social network-based recommender systemspzs.dstu.dp.ua/datamining/social/bibl/social_network...d....

Daniel Schall

Social Network-Based Recommender Systems

Social Network-Based Recommender Systems

Daniel Schall

Social Network-BasedRecommender Systems

123

Daniel SchallSiemens Corporate TechnologyWien, Austria

ISBN 978-3-319-22734-4 ISBN 978-3-319-22735-1 (eBook)DOI 10.1007/978-3-319-22735-1

Library of Congress Control Number: 2015951351

Springer Cham Heidelberg New York Dordrecht London© Springer International Publishing Switzerland 2015This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part ofthe material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,broadcasting, reproduction on microfilms or in any other physical way, and transmission or informationstorage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodologynow known or hereafter developed.The use of general descriptive names, registered names, trademarks, service marks, etc. in this publicationdoes not imply, even in the absence of a specific statement, that such names are exempt from the relevantprotective laws and regulations and therefore free for general use.The publisher, the authors and the editors are safe to assume that the advice and information in this bookare believed to be true and accurate at the date of publication. Neither the publisher nor the authors orthe editors give a warranty, express or implied, with respect to the material contained herein or for anyerrors or omissions that may have been made.

Printed on acid-free paper

Springer International Publishing AG Switzerland is part of Springer Science+Business Media (www.springer.com)

www.springer.com

www.springer.com

To my son Kilian

Preface

People increasingly use social networks to manage various aspects of their livessuch as communication, collaboration, and information sharing. A user’s networkof friends may offer a wide range of important benefits such as receiving onlinehelp and support and the ability to exploit professional opportunities. One of themost profound properties of social networks is their dynamic nature governed bypeople constantly joining and leaving the social networks. The circle of friends mayfrequently change when people establish friendship through social links or whentheir interest in a social relationship ends and the link is removed.

This book introduces novel techniques and algorithms for social network-basedrecommender systems. Here, concepts such as link prediction using graph patterns,following recommendation based on user authority, strategic partner selectionin collaborative systems, and network formation based on “social brokers” arepresented. In this book, well-established graph models such as the notion of hubsand authorities provide the basis for authority-based recommendation and aresystematically extended towards a unified Hyperlink Induced Topic Search (HITS)and personalized PageRank model. Detailed experiments using various real-worlddatasets and systematic evaluation of recommendation results proof the applicabilityof the presented concepts.

Vienna, Austria Daniel SchallJune 2015

vii

Acknowledgements

This book provides a detailed summary and new viewpoints on the author’s researchin the field of social network analysis and link formation techniques. Daniel Schallreceived his Ph.D. degree in computer science from the Vienna University ofTechnology in 2009.

He started his research career at Siemens Corporate Research in Princeton, NJ,USA, in 2003, where he was employed as a technical associate. In 2006, he beganhis doctoral studies at the Vienna University of Technology. At the Vienna Univer-sity of Technology, he was involved in a number of European funded FP6 and FP7projects both as a project manager and key researcher. Daniel was a principal inves-tigator of crowdsourcing and social computing activities at the Distributed SystemsGroup. The main results of his research in the context of crowdsourcing and HumanProvided Services were published in the book Service-Oriented Crowdsourcing:Architecture, Protocols and Algorithms. Also, he published more than 40 scientificpapers at top-ranked international conferences and more than 20 scientific journalpapers in highly ranked journals and renowned magazines including the Journalof Informetrics, Decision Support Systems, Data and Knowledge Engineering,Information Systems, IEEE Transactions on Services Computing, IEEE Computer,IEEE Internet Computing, and Social Network Analysis and Mining.

Dr. Daniel Schall is currently employed as a senior key expert research scientistat Siemens Corporate Technology in Vienna, Austria.

ix

Contents

1 Overview Social Recommender Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Recommendations in Social Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Recommendation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.2 Follow Recommendation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.3 Partner Recommendation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.4 Broker Recommendation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Research Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Book Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Link Prediction for Directed Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1 Friendship Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Background in Link Prediction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Software Framework .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 Predicting Friendship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.1 Similarity-Based Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.2 Triad Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.3 Triadic Closeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4.4 Triadic Closeness Example.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5 Data Collection .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.6.1 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.6.2 Prediction Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 Follow Recommendation in Communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.1 Social Collaboration Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2 Background in Online Communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.3 Recommendation Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

xi

xii Contents

3.4 Follow Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.4.1 Authority-Based Recommendation Model . . . . . . . . . . . . . . . . . . . . 373.4.2 Personalized Authority Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.4.3 Weights and Personalization Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.5 Data Collection .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.5.1 GitHub Community.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.5.2 User Activity by Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.5.3 Programming Languages .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.5.4 Follower Graph and Reciprocity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.6.1 Recommendations Without Personalization .. . . . . . . . . . . . . . . . . . 513.6.2 Personalized Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52


4 Partner Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.1 Social Network-Based Collaboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.2 Background in Strategic Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.3 Reputation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.3.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.3.2 Hubs and Authorities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.3.3 Query-Sensitive Personalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.3.4 Time-Aware Authority. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.4 Structural Importance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.5 Framework and Ranking Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.5.1 Software Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.5.2 Ranking Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.6.1 Description of Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.6.2 Top-k Rank Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.6.3 Statistical Comparison .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83


5 Social Broker Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955.1 Virtual Organizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955.2 Background in Distributed Organizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 975.3 Hybrid Compute Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.3.1 Human-Provided Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995.3.2 Emergence and Evolution of Social Trust . . . . . . . . . . . . . . . . . . . . . 100

5.4 Expert Communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.4.1 Collaboration Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.4.2 Brokerage and Composition.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.4.3 Broker Patterns and Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.5 SBQL Syntax .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1055.5.1 Connecting Predefined Communities . . . . . . . . . . . . . . . . . . . . . . . . . . 106

Contents xiii

5.5.2 Finding Communities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1085.5.3 Finding Exclusive Brokers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.6 Broker Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1095.6.1 Community Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1095.6.2 Trust Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.6.3 Broker Importance .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.7.2 Performance Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1165.7.3 Ranking Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1185.7.4 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121


6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

Chapter 1Overview Social Recommender Systems

Abstract This chapter gives an introduction to social network-based recommendersystems. The main recommendation techniques as presented in this book includinglink prediction, follow recommendation, partner recommendation using reputationevaluation, and social broker recommendation are highlighted.

1.1 Recommendations in Social Networks

In recent years considerable attention has been devoted to the analysis of socialnetworks structures. Social networks typically consist of nodes representing peopleor other entities embedded in a social context and edges representing interaction,collaboration, or some other form of linkage between entities. Examples of socialnetworks include personal social networks such as Facebook, Twitter, Google Plusor professional social networks such as LinkedIn. Other social network basedsystems are, for example, the set of organizations collaborating in the contextof research projects or organizations forming professional virtual communities.The availability of large, detailed social network datasets has stimulated extensiveresearch of their basic properties. Social networks are highly dynamic and grow andchange quickly over time through the addition of new edges or removal of existingedges.

Link formation techniques support the discovery and establishment of socialrelations in social network based systems. The applications include:

• Establishment of New Social Relations. The emergence of new personal relationsis actively facilitated through link prediction techniques.

• Supporting the Formation of Expert Communities. Following recommendationsare based on community expertise and increase the cohesiveness of onlinecommunities.

• Strategic Formation of Teams. The automatic discovery of community reputationand structural holes helps in establishing competitive compositions of researchorganizations.

• Bridging Online Communities. Fragmented communities are connected throughsocial brokers who act as intermediaries between communities and strengtheninformation exchange.

© Springer International Publishing Switzerland 2015D. Schall, Social Network-Based Recommender Systems,DOI 10.1007/978-3-319-22735-1_1

1

2 1 Overview Social Recommender Systems

The next section provides an introduction to main link formation techniques insocial network based systems as detailed in this book.

1.2 Recommendation Techniques

Figure 1.1 gives an overview of the presented recommendation techniques. Thetechniques are structured into peer based recommendations (upper row in Fig. 1.1)and group based recommendations (lower row in Fig. 1.1). Each recommendationtechnique will be introduced in the following sections.

Pee

rG

rou

p

?

Link prediction for

directed graphs

?

Follow recommendation in

communities

Partner

recommendation

?

?

?

Social broker

recommendation

Fig. 1.1 Recommendation techniques overview

1.2 Recommendation Techniques 3

1.2.1 Link Prediction

In today’s online social networks it becomes essential to help newcomers as well asexisting community members to find new social contacts. In scientific literature thisrecommendation task is known as link prediction [6]. Link prediction has importantpractical applications in social network platforms. It allows social network platformproviders to recommend friends to their users. Another application is to infermissing links in partially observed networks.

The meaning of a recommendation varies depending on the concrete socialnetwork platform. In platforms such as Facebook a link between two people isestablished if both persons agree to have a friendship relation. The resulting networkis thus undirected because both persons share mutual friendship. Another exampleof a social network is Twitter. In Twitter a link between two persons is established ifa user is interested in news updates of another user. The link is thus directed becausethere is no mutual agreement needed to establish a link. The resulting network isdirected and is also called follower network. User can follow an arbitrary numberof other users to receive news or activity updates. The shortcoming of many ofthe existing link prediction methods is that they mostly focus on undirected graphsonly [7, 14].

1.2.2 Follow Recommendation

Open source development allows a large number of people to reuse and contributesource code to the community. Social networking features open opportunities forinformation discovery, social collaborations, and improved recommendations ofpotential collaborators. Online community and development platforms rely on socialnetwork features to increase awareness and attention among community membersfor improved collaborations.

In networks such as LinkedIn or Facebook friendship is represented as recipro-cated links in an undirected graph. Services such as Twitter and recently GitHub arebased on a directed network approach. A directed network approach allows users tofollow other users based on their interest without requiring them to reciprocate therelationship. In traditional social networks, some users may be followed by manypeople without following many peers themselves (“stars” or “celebrities”). Is thisalso the case for online social collaboration networks such as GitHub? People inGitHub are mostly followed because they work on interesting projects. Thus, thisdifference between conventional social networks and online social collaborationnetworks requires a novel “who to follow” recommendation approach [13].


1.2.3 Partner Recommendation

Scientific collaborations commonly take place in a global and competitive environ-ment. Coalitions and project consortia are formed among universities, companiesand research institutes to apply for research grants and to perform jointly collab-orative projects. In such a competitive environment, individual institutes may bestrategic partners or competitors. Measures to determine partner importance havepractical applications such as comparison and rating of competitors, reputationevaluation or performance evaluation of companies and institutes [10].

However, the success of research and innovation is based on the right balancebetween cooperation and competition. Hence, formation of coalitions and consortiais influenced by partner reputation, institutional constraints, and mechanism of self-organization. Scientific collaboration can be analyzed at the level of researchersthrough co-authorship and citation networks or at the level of organizations orresearch institutions [5]. The former has been widely studied by existing researchwhile the latter lacks a principled approach for selecting and aggregating rankingcriteria that may be influenced by context [11].

1.2.4 Broker Recommendation

The rapid advancement of ICT-enabled infrastructure has fundamentally changedhow businesses and companies operate. Global markets and the requirementfor rapid innovation demand for alliances between individual companies. A vir-tual organization can be defined as follows [4]: an inter-organizational virtualorganization is a temporary network organization, consisting of independent enter-prises (organizations, companies, institutions, or specialized individuals) thatcome together swiftly to exploit an apparent market opportunity. As such, virtualorganizations act in all appearances as a single organizational unit.

Principles found in social network theory are promising candidate techniquesto assist in the formation process and to support flexible and evolving interactionpatterns in cross-organizational environments. In social networks, relations andinteractions typically emerge freely and independently without restricted pathsand boundaries. Research in social sciences has shown that the resulting socialnetwork structures allow for relatively short paths of information propagation [16].While this is true for autonomously forming social networks, the boundaries ofcollaborative networks are typically restricted due to organizational units andfragmented areas of expertise. This demands for novel formation patterns such asbrokers [1, 12].

1.4 Book Outline 5

1.3 Research Datasets

Most experiments performed in this research are based on real datasets. The maindatasets used in this research are all publicly available and include:

• A Twitter dataset has been obtained from [15] and has been used to performvarious experiments in the context of link prediction.

• A Google Plus dataset has been obtained from [15] and has also been used toperform link prediction.

• A GitHub dataset has been obtained from [2] and event information from [3].The resulting dataset has been used to perform link prediction, and followrecommendations.

• A statistical report of research activities in the European’s ICT program has beenobtained from [8] and used to perform experiments in the context of strategicpartner selection and social broker discovery.

The rich set of data collections provides a solid basis for performing comprehen-sive experiments to derive insights and key findings. In the following, an outline ofthe book is given.

1.4 Book Outline

This book is organized as follows. In Chap. 2 we introduce link prediction methodsand metrics for directed graphs. We compare well known similarity metrics andtheir suitability for link prediction in directed social networks. Chapter 3 introducesour “who to follow” recommendation model. Link analysis techniques such asPageRank and HITS provide the basis for a novel “who to follow” recommendationmodel. In Chap. 4 we present a novel approach for measuring and combingvarious criteria for partner importance evaluation. The presented approach is costsensitive, aware of temporal and context-based partner authority, and takes structuralinformation with regards to structural holes into account. Chapter 5 focuses on thenotion of brokers who act as intermediaries between segregated communities. Weintroduce a broker discovery and ranking approach utilizing a link-based brokerimportance model. Finally, the book is concluded in Chap. 6.

The work presented in this book is based on the author’s research performedover the last years. Prior work of the author includes research in crowdsourcing,online communities and social network analysis (see [9]). This work seamlesslygoes into similar fields of research and expands more deeply in social networks andlink formation techniques in social network based systems.

Some material has been adapted for this book based on the author’s journalpublications [11–14]. All material has been revisited, additional experiments havebeen performed, and further enhancements in the concepts and implementation havebeen done. Among others, new concepts include machine learning extensions in the


link prediction framework, the notion of an app and service marketplace in hybridcompute environments, and corporate policies in the context of partner discoveryand selection.

References

1. R. S. Burt. Structural Holes and Good Ideas. The American Journal of Sociology,110(2):349–399, 2004.

2. GitHub. Online: www.github.com (last access June 2015).3. I. Grigorik. Online: www.githubarchive.org (last access June 2015).4. E. C. Kasper-Fuehrera and N. M. Ashkanasy. Communicating trustworthiness and building

trust in interorganizational virtual organizations. Journal of Management, 27(3):235–254, June2001.

5. N. Lavrac, P. Ljubic, T. Urbani, G. Papa, M. Jermol, and S. Bollhalter. Trust modeling fornetworked organizations using reputation and collaboration estimates. IEEE Transactions onSystems, Man, and Cybernetics, Part C, 37(3):429–439, 2007.

6. D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. In Pro-ceedings of the twelfth international conference on Information and knowledge management,CIKM ’03, pages 556–559, New York, NY, USA, 2003. ACM.

7. L. Lu and T. Zhou. Link prediction in complex networks: A survey. Physica A: StatisticalMechanics and its Applications, 390(6):1150–1170, 2011.

8. F. Munisteri. ICT statistical report for annual monitoring 2011. http://ec.europa.eu/digital-agenda/sites/digital-agenda/files/stream_2012_0.pdf, Feb. 2012.

9. D. Schall. Service Oriented Crowdsourcing: Architecture, Protocols and Algorithms. SpringerBriefs in Computer Science. Springer New York, New York, NY, USA, 2012.

10. D. Schall. Measuring contextual partner importance in scientific collaboration networks.J. Informetrics, 7(3):730–736, 2013.

11. D. Schall. A multi-criteria ranking framework for partner selection in scientific collaborationenvironments. Decision Support Systems, 2013.

12. D. Schall. Socially-based brokerage and composition in virtual communities. IJNVO,12(3):251–281, 2013.

13. D. Schall. Who to follow recommendation in large-scale online development communities.Information and Software Technology, 2013.

14. D. Schall. Link prediction in directed social networks. Social Network Analysis and Mining,2014.

15. Stanford. Online: http://snap.stanford.edu/data/index.html (last access June 2015).16. D. J. Watts and S. H. Strogatz. Collective dynamics of ‘small-world’ networks. Nature,

393(6684):440–442, June 1998.

www.github.com

www.githubarchive.org

http://ec.europa.eu/digital-agenda/sites/digital-agenda/files/stream_2012_0.pdf


http://snap.stanford.edu/data/index.html

Chapter 2Link Prediction for Directed Graphs

Abstract In this chapter we introduce link prediction methods and metrics fordirected graphs. We compare well known similarity metrics and their suitabilityfor link prediction in directed social networks. We advance existing techniques andpropose mining of subgraph patterns that are used to predict links in networks suchas GitHub, GooglePlus, and Twitter. Our results show that the proposed metricsand techniques yield more accurate predictions when compared with metrics notaccounting for the directed nature of the underlying networks.

2.1 Friendship Recommendation

Social networks have become ubiquitous in our everyday activities. People usesocial networks to communicate, collaborate, and share information. One of themost profound properties of social networks is their dynamic nature. People joinand leave social networks. Also, the circle of friends may frequently change whenpeople establish friendship through social links or when their interest in a socialrelationship ends and the link is removed. Due to the large number of users beingpart of today’s online communities, it becomes increasingly cumbersome to findnew contacts and friends. Many social network platform providers assist their usersin establishing new social relations by making recommendations. The meaning ofa recommendation varies depending on the concrete social network platform. Inplatforms such as Facebook [11] a link between two people is established if bothpersons agree to have a friendship relation. The resulting network is thus undirectedbecause both persons share mutual friendship. Another example of a social networkis Twitter [43]. In Twitter a link between two persons is established if a user isinterested in news updates of another user. The link is thus directed because thereis no mutual agreement needed to establish a link. The resulting network is directedand is also called follower network. User can follow an arbitrary number of otherusers to receive news or activity updates. The “follow” feature is in widespreaduse in social networking services such as Twitter [43], Facebook [11], or Google-Plus [14]. In Facebook, the users can follow (or unfollow) news updates of theirfriends. The social (undirected) link between friends is maintained irrespective ofthe follow relationship. Recently collaborative online platforms such as GitHub [12]


7

8 2 Link Prediction for Directed Graphs

offer also social network features (e.g., following). GitHub is an online “coding”community. Users contribute code and share repositories with the community. The“follow” feature in GitHub allows users to keep track of updates regarding varioussoftware development activities such as coding or bug-fixing.

All of the before mentioned platforms have a large number of users and benefitfrom “follow” recommendations. Such recommendations can be formulated as alink prediction task. Link prediction in a directed follower network has the purposeto give recommendations who a given user should follow. Despite of some existingliterature in the area of link prediction on Twitter (for example, see [8, 33]), orprediction of positive/negative edges [23], there is relatively little work on linkprediction in directed social networks. As reported in a recent survey [26], theexisting studies on link prediction overwhelmingly focus on undirected networks.This work specifically addresses the link prediction problem in directed socialnetworks.

The essential approach we follow in this work is to measure the similaritybetween a pair of nodes using structural information. We assume that a socialnetwork is modeled as a directed graph G(V,E) where vertices V in the graphdepict people and edges E between vertices relations between people. Structuralinformation is, for example, the number of common neighbors between two nodes.Generally speaking, similarity of nodes can be measured as the number of commonfeatures. The goal of link prediction is to predict whether a link between two userswill be established or if a link in a partially observed network is missing. The lattercase is a common problem when the social network is obtained through crawling.

Here we provide the following key contributions:

• We propose link prediction techniques using graph patterns (triads). Predictionsare then given based on the probability that a given type of triad pattern will beclosed.

• Here we introduce a metric called Triadic Closeness. The application of themetric is discussed and evaluated.

• We have designed and implemented a link prediction framework. The mainelements of the framework are discussed.

• We present an analysis of our link prediction techniques. We use three differentsocial networks including GitHub, GooglePlus and Twitter to validate theproposed approach.

This chapter is structured as follows: Sect. 2.2 discusses related work in thecontext of social networks and link prediction. Standard similarity metrics areintroduced which will be compared against our proposed approach. In Sect. 2.3 thelink prediction framework is introduced followed by a discussion on our predictionapproach Sect. 2.4. The data collection is introduced in Sect. 2.5. The results andexperiments are discussed in Sect. 2.6. The conclusion with outlook to further workis given in Sect. 2.7.

2.2 Background in Link Prediction 9

2.2 Background in Link Prediction

We discuss related work by (1) highlighting literature and related approaches withrespect to link prediction and node similarity indices and (2) local structures andpatterns in social networks. Similarity indices are structured in local, global, andsemi-local indices.

• Local Similarity Indices. A wide range of similarity metrics exist that can beused to predict links based on “local” information [1, 2, 10, 22, 31, 32, 34, 46].Local information is typically obtained by comparing degree of overlap oftwo individual friendship networks. Liben-Nowell et al. [24] systematicallycompared a number of local similarity indices in many real networks. Thesemetrics focus on undirected graphs without considering directed relations. Theadvantage of local indices is that they can be computed for large-scale networksand do not require a huge amount of computational resources.

• Global Similarity Indices. Global metrics take the properties of the wholesocial network into account. The Katz index [20] is based on the ensembleof all paths between two nodes in the network. The index is computed as thesum over the collection of all paths and is exponentially damped by lengthto give the shorter paths more weight. Another class of metrics are randomwalk techniques. Well-known algorithms such as PageRank [30] can be usedto compute global importance metrics. The prediction is than based on thenode’s PageRank importance score. PageRank can be personalized to performranking with respect to a certain “contexts” or topics [36]. A direct applicationof personalized PageRank are Supervised Random Walks (SRW). In [5], SRWwere proposed to recommend links in networks such as Facebook. SimRank [18]is based on the idea that two nodes are similar if they are related to similar nodes.

Global similarity indices naturally require information regarding the wholetopology of the social network. Indeed, this information may not be availabledue to, for example, partially observed networks or in cases where the platformis decentralized. Another important aspect is performance and resource con-sumption. The calculation of global indices may be very time-consuming andfor large-scale networks the computation may not be feasible.

• Semi-Local Indices. Instead of taking the whole topology into account, semi-local indices omit information that makes little contribution to improve theprediction algorithm’s accuracy. The Local Path Index [28, 46], for example,provides a trade-off between computational complexity and accuracy. LocalRandom Walks [5, 25] follow a similar idea by omitting information from verydistant neighbors in the network.

Further methods for link prediction include hierarchical models [9], stochasticblock models [3, 16, 45], probabilistic models [26], and methods consideringpositive/negative links [42] (see [26] for details on models and methods).

• Patterns, Triads and Motifs. The approach as proposed in this work takesthe directed nature of follower networks into account. Triadic closure in socialnetworks is the hypothesis that the formation of an edge between u and vis strongly dependent upon the degree of overlap of u’s and v’s individual


friendship networks (for example, see [39, 44]). However, an important theoryin this context is the “strength of weak ties” [15] stressing the cohesive powerof seemingly less important ties. Holland and Leinhardt [17] developed manyimportant theories about social relations and how to detect structure in directednetworks. As a more general conceptual framework, network motifs [4, 29]represent elementary elements in complex networks. Complex networks includesocial, technological, or biological networks. We build upon these ideas andpropose link prediction considering triad patterns in social networks.

Finally, previously we stressed the importance of social networks and formationsof social groups (teams) in the context of collaborative environments and novelcrowdsourcing environments [35, 37, 38]. We foresee important applications of linkprediction in these areas.

2.3 Software Framework

One of the goals of the present work has been to design a modular and reusableframework for link prediction in social networks. The framework must be able tohandle different social networks that can range from a few thousands to millions ofnodes in the social graph. This section gives an overview of the link predictionframework architecture and a description of the evaluation methodology. Theframework is extensible with regards to prediction metrics and algorithms.

The main elements and layers are depicted by Fig. 2.1. The overall frameworkis segmented into a multilayer architecture consisting of three layers: (1) DataLayer, (2) Prediction Layer, and (3) Presentation Layer. Each layer has distinctresponsibilities and each block within a given layer provides a well-defined set ofinterfaces. The central goal of the system is to provide an extensible framework thatcan be enhanced with new metrics and prediction methods. In addition, it should beeasy to add new social networks (datasets) and to compare the results of differentexperiments and algorithms. The architectural layers are described in the following.

Data Layer The bottom-most layer is concerned with low-level data handlingand persistence management. With regards to the basic link prediction task, theData Layer passes an instance of a social network graph to the upper layer. Ourframework has the ability to access data from (a) the Flat File Store and (b) theDatabase. The Flat File Store is used for simple social network data files that aresmall to medium in their size (e.g., 103–105 nodes) and is read-only. The Parserreads and interprets files stored in the Flat File Store. The Parser performs the pre-stage processing for the Graph Mapper, which creates the social network graphincluding node and edge attributes (if available). The Database is able to managelarge social networks (large networks may consist of up to 5×107 nodes1) that may

1The upper bound for which the Data Layer has been tested was a network consisting ofapproximately 5×107 nodes and 1.5×109 edges.

2.3 Software Framework 11

Data Layer

File System

Prediction Layer

Presentation Layer

Parser Graph Mapper Import Library Update Handler Query Handler

Flat File Store Database

Link Predictor Experiment Manager

Evaluation Metrics

Iterative AlgorithmsSimilarity Metrics

Similarity Library

Triad Library

Graph Visualization Metric Visualization Matlab Export

Machine Learning

Fig. 2.1 Link prediction framework overview

also include details regarding user profiles and user activity. The Query Handlerinterfaces with the database to retrieve the social networks graph structure, userprofile details, other social network related information and also information relatedto patterns and experiments. The Update Handler is responsible for persistencemanagement and writes data to the database. The Import Library allows externalsocial networks and networks stored in the Flat File Store to be migrated to theDatabase. Migration from the Flat File Store is needed if performance is insufficient(read operations) or if the management of social network (meta)data through the FlatFile Store becomes impractical. Another source of information are external APIs ofsocial network or community platforms. GitHub, for example, provides an API [13]to retrieve the follower network. The Import Library provides a rate-aware APIinvocation scheduler to retrieve large social networks respecting the social networkproviders’ API policies (e.g., number of invocations per hour).

Prediction Layer The middle layer groups the logic for metric calculation, patternmining (triad detection), prediction, and prediction result evaluation. The SimilarityLibrary is responsible for calculating various similarity indices including local


Table 2.1 Similarity-based metrics

Metric Definition Description

Common Neighbors (CN) |Γ (u)∩Γ (v)| Intersection set size of jointneighbors between nodes uand v

Salton Index (SA)|Γ (u)∩Γ (v)|√

ku × kvThe degree of u and v isdepicted by ku and kv

respectively. In literature, theSalton index [34] is alsocalled the cosine similarity

Jaccard Index (JA)|Γ (u)∩Γ (v)||Γ (u)∪Γ (v)| Jaccard similarity index with

Γ (u) = /0 and Γ (v) = /0

Sørensen Index (SO)2|Γ (u)∩Γ (v)|

ku + kvThe Sørensen index [40] ismainly used for ecologicaldata. The index is identical tothe Dice’s coefficient

Hub Promoted Index (HP)|Γ (u)∩Γ (v)|

min(ku,kv)The links adjacent to hubsare likely to be assignedhigher scores since thedenominator min(ku,kv) isdetermined by the lowerdegree [31]

Hub Depressed Index (HD)|Γ (u)∩Γ (v)|max(ku,kv)

Similar to HPI but with theopposite effect with regardsto adjacent hub links

Leicht-Holme-Newman Index (LHN)|Γ (u)∩Γ (v)|

ku × kvThe denominator ku × kv isproportional to the expectednumber of commonneighbors [22]

Adamic-Adar Index (AA) ∑z∈Γ (u)∩Γ (v)

1log(kz)

The index assigns theless-connected neighborsmore weight than CN [1]

Resource Allocation Index (RA) ∑z∈Γ (u)∩Γ (v)

1kz

Similar to AA, RA depressesthe contribution of thehigh-degree commonneighbors [46]

similarity indices (see Table 2.1), global similarity indices, and semi-local indices.The system block Similarity Metrics computes local similarity indices (includingTC) and semi-local indices including Local Path Index (see [26]) and the ShortestPath Index (i.e., the average Dijkstra Shortest Path Index between, say, u and v’sneighbors Γ (v) to measure similarity between u and v). Iterative Algorithms havebeen designed to calculate metrics such as SimRank [18] and PageRank variantssuch as personalized PageRank [19, 30, 36]. Note, the evaluation of global and semi-local indices is not within the scope of this work. Here we focus on local indices in

2.3 Software Framework 13

conjunction with triad patterns. The relationship between (local) triad patterns andglobal or semi-local indices is a whole new subject of investigation itself. Next tothe Similarity Library, the Triad Library provides the capabilities for triad patterndetection and caching. Clearly, scanning the entire social network graph for triadpatterns is a time consuming task with complexity O(m) where m is the number ofedges in the graph (see [6] for related triad detection algorithms).

Thus, triad pattern mining is usually performed once for a given social networkand subsequent metric calculations use the precomputed triad frequencies.

The Link Predictor is the main component that performs the prediction task. Thiscan be done to predict future links, which do not yet exist, or predict links, whichare “missing” (unobserved). Here we focus on the latter case where we assume thatcertain links are missing between pairs of nodes. To test the metrics’ accuracy, wedivide the set of edges E randomly into the prediction (or training) set EP and thevalidation set EV . The prediction algorithm basis its calculation upon the predictiongraph GP(V,EP) whereas the accuracy of the prediction results is determined byinspecting the missing (randomly removed) edges in EV . Note, no information fromthe set EV is allowed to be used for prediction so that EP∪EV = E and EP∩EV = /0.Furthermore, to speed up computation of prediction results, the Link Predictor canperform node sampling to calculate predictions for a subset of node pairs insteadof calculating predictions for all node pairs in the entire graph. For that purposethe predictor samples a set of random nodes UP with k > 0 and divides the set intotwo subsets UR and UT with UR ∪UT = UP, UR ∩UT = /0, and UP ⊂ U. The setUP contains all nodes that are used for link prediction, the set UR contains the rootnodes (source vertices of predicted links) and UT contains the target nodes (targetvertices of predicted links). The set EV contains only directed links whose sourcevertex is in UR and whose target vertex is in UT . For one given experiment the samenode set UP and edge set EV is used to be able to compare the results of differentmetrics among each other.

The basic steps of the Link Predictor are straightforward. Algorithm 1 shows thesteps:

Algorithm 1 Link prediction algorithm1: input: G(U,EP),UR,UT ,EV

2: for each User u ∈ UR do3: for each User v ∈ UT do4: // True Answer5: answer ← HasEdge(u,v,EV )6: // Calculate Similarity Scores7: for each Metric m ∈ M do8: suv ← CalculateScore(u,v,m,G)9: // Save Result to Experiment Database

10: SaveResult(u,v,m, suv,answer)11: end for12: end for13: end for


The predictor loops through UR and UT and calculates similarity scores suv usingeach metric m ∈ M provided by the Similarity Library. M can be configured dynam-ically by enabling/disabling the desired metrics to be used in each experiment. Theprediction result suv and the actual true answer 0,1 (HasEdge is a binary classifierthat determines whether or not the set EV contains a directed edge between u and v)are saved in the experiment database.

The Machine Learning component provides an additional approach to linkprediction by learning ensembles of decision trees. This technique is called randomforest2 for classifying if there will be link between two nodes from an input vectorof node features (e.g., number of common friends, age, interests, etc.). A detaileddiscussion on random forest based link prediction is not provided in this book.

To compare the results of different similarity algorithms, the Evaluation Metricscomponent provides standardized comparison methods. In particular, we compareresults using the Receiver Operating Characteristic (ROC) curve. ROC curves arecommonly used in the machine learning community for the link prediction task[2, 9]. ROC curves are created by plotting the true positive rate over the false positiverate. The area under the ROC curve (AUC) [7] can be interpreted as the probabilitythat a randomly chosen missing link (i.e., a link in EV ) is given a higher score thana randomly chosen non-existing link [26].

Higher AUC values, which are in the range [0,1], indicate better predictionperformance. Another common metric to measure a prediction algorithm’s accuracyis HitRatio (or recall). Generally, HitRatio is defined as the ratio of selected relevantitems to the number of relevant items. For example, the HitRatio is typicallymeasured at a threshold HitRatio@n where n is the number of selected items. In thiswork, we focus on both ROC curves and AUC as well as HitRatio for experimentevaluation and metric comparison. The Experiment Manager saves and retrievesexperiments results from the Database and computes aggregates of results.

Presentation Layer The frontend of the prediction framework is a presentationlayer that has visualization and export capabilities. The Graph Visualization allowsto view typical graph properties by mapping node/edge features into a visualrepresentation. To do so, the correspondence between discrete or continuous valuesand visual properties (color, node size, etc.) needs to be established. The MetricVisualization is the most important tool for evaluating the results of the similarityalgorithms and link predictor. ROC curves help to identify which methods andparameter settings are best suited for a given type of social network. The variousnetwork idiosyncrasies such as average degree 〈k〉 may demand for metric tuning.A detailed discussion regarding metric accuracy will follow in Sect. 2.6. The MetricVisualization helps to understand the accuracy and suitability of different metrics.The Matlab Export allows to export experiment results to a Matlab compatibleformat to utilize various Matlab toolboxes.

2Web page: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm.

https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

2.4 Predicting Friendship 15

2.4 Predicting Friendship

2.4.1 Similarity-Based Metrics

The focus of this work are local similarity indices and their extension towardslink prediction using graph patterns. Table 2.1 lists a set of well-known similaritymetrics. The shared feature of the metrics is that computation of similarity is basedon the set of joint neighbors. These metrics provide the basis for the definition ofour Triadic Closeness (TC) similarity metric. The definition of the TC metric willbe provided in the following. Furthermore, the metrics in Table 2.1 will be usedin a comparative study to test the effectiveness of the proposed technique. Table 2.1provides a mathematical definition along with a brief description of the given metric.Given node u, the set of neighbors is depicted by Γ (u). The degree of node u isdepicted as ku = |Γ (u)|.

These metrics have the drawback that they do not account for the directed natureof follower networks. In other words, the metrics in Table 2.1 do not allow fordifferentiation whether a link will be established from, say, u to v or from v to u. Asa next step we introduce patterns to account for directed links in social networks.

2.4.2 Triad Patterns

In social network theory a basic unit of analysis is a dyad. In undirected networks, adyad is a pair of nodes who may share a social relation with one another. In directednetworks, a dyad consists of a pair of nodes who may share a social relation throughmutual links, an unreciprocated relation, or no relation. Unreciprocated means thatone node is interested in the other node but not vice versa. A triad is a set of threeparties, which consists of three dyads. A triad is “closed” if all nodes are linked witheach other in some manner. A closed triad is also called triangle.

Figure 2.2 shows triad patterns of the actors u, z, and v. Edges are directedbecause our aim is to model patterns in directed social networks (e.g., followernetworks). All patterns are open triads with z being the common neighbor of u and v.The questions with regards to link prediction can be stated as follows:

• What is the likelihood that u will establish a link towards v?• In a partially observed network, is there a missing link pointing from u to v?

In Fig. 2.2 all possible connectivity configurations between u, z, and v are shownwith the condition that u and v are not directly connected. In this work, open triadsare labeled as T0X where X is the running index with X = [1,9]. The pattern T01shows the case where u and z as well as v and z are mutually connected. Accordingto the theory of triadic closure, the chances are high that u will also connect tov (i.e., z’s friends will likely become u’s friends). In a follower network, the pairu and z and v and z would mutually follow each other. In T02, only u and z are


z

u v

z

u v

z

u v

z

u v

z

u v

z

u v

z

u v

z

u v

z

u v

a b c

d e f

g h i

Fig. 2.2 Triad graph patterns. (a) T01. (b) T02. (c) T03. (d) T04. (e) T05. (f) T06. (g) T07.(h) T08. (i) T09

Fig. 2.3 Closed triads T1X.(a) T11. (b) T19 z

u v . . .

z

u v

a b

mutually connected to each other. The node v is followed by z but the relationshipis not reciprocated. T03, T05, and T07 depict complementary cases where a mutualrelation among one dyad exists. The other cases depicted by T04, T06, T08, and T09show patterns without mutual relations among the dyads. The goal of link predictionis to determine which of the triads are or will be closed (i.e., becoming a triangle).A triad can be closed as follows if u establishes a link to v, v establishes a link to u,or if u and v establish a link mutually.

The following figures show closed triads based on T01 and T09 (the first and thelast pattern of Fig. 2.2 are shown for brevity). Figure 2.3 shows the patterns wherethe triads are closed from u to v. The patterns are labelled similarly as in Fig. 2.2 butwith a base offset of 10. Thus, the label of triads that are closed via u to v is T1Xwith X = [1,9].



u v . . .

z

u v

a b


u v

. . .

z

u v

a b

In the same manner, the label of triads that are closed via v to u is T2X withX = [1,9] (see Fig. 2.4).

Finally, if the triads are closed by mutually connected u and v (see Fig. 2.5) thelabels T3X with X = [1,9] are applied. To summarize our discussions regardingtriad patterns, triads that are relevant for link prediction may have 36 differentconfigurations with regards to how nodes are connected to each other throughdirected links. Open triads have 2 connected dyads and have 2 to 4 links. Closedtriads have 3 connected dyads and have 3 to 6 links.

The next step is to introduce a novel metric to calculate a score for link predictionbased on the presented triad patterns.

2.4.3 Triadic Closeness

When considering a given pattern T0X (open triads as depicted by Fig. 2.2) thebasic question is which of those patterns are likely closed triads (in the case ofmissing links) or which of those patterns will likely be closed in the future. Herewe introduce Triadic Closeness (TC) to measure how close a pair of disconnectednodes are in terms of how the pair is connected through triads. TC is based on thefollowing basic idea:

Triadic Closeness ∝Number of closed triads

Number of potentially closed triads

Triadic Closeness is thereby based on the ratio of the number of closed triadsversus the number of potentially closed triads. Indeed, the chance that a given T0Xtriad will be closed depends on the actual social network and is most likely not thesame for all follower networks. We define the TC score of the pair u and v in adirected graph G as follows:


TCuv = ∑z∈Γ (u)∩Γ (v)

wP(u,v,z)×w(z) (2.1)

The score is calculated over all common neighbors. For a given neighbor z,the product is calculated by the triad weight wP(u,v,z) times the neighbor specificweight w(z). The triad weight wP(u,v,z) is defined as follows:

wP(u,v,z) =F(T(u,v,z)+ 10)+F(T(u,v,z)+ 30)

F(T(u,v,z))(2.2)

The function T(u,v,z) retrieves the triad pattern ID that matches the triad u, v, andz. The term (T(u,v,z)+10) simply means that the ID of the closed triad counterpartof T(u,v,z) is obtained (closed via u to v). Similarly, (T(u,v,z)+30) gets the closedcounterpart triad ID wherein T(u,v,z) is closed through mutual links between uand v. The function F(·) retrieves the frequency of the given triad pattern. Priorto performing the calculation of wP(u,v,z), the frequencies of triads in a particularsocial network are computed by an algorithm and saved in a database. Afterwards,F(·) simply retrieves the triad frequency from a database (zero if the given triad wasnot detected in the graph).

The neighbor specific weight w(z) can be tuned to account for the characteristicsof specific social networks. For the basic case with w(z) = 1, TCuv is only based onwP(u,v,z). We define the weight as w(z) = 1

kzto give less connected neighbors more

weight and thus TCuv becomes:

TCuv = ∑z∈Γ (u)∩Γ (v)

wP(u,v,z)× 1kz

(2.3)

Neighbors that are unique to only a few users are weighted more with w(z) = 1kz

than popular neighbors. Popular neighbors are those with a high degree kz andespecially in networks such as Twitter these neighbors may be celebrities that maynot have great significance for the triadic closure process. From a technical pointof view, TCuv’s behavior is comparable to Adamic-Adar Index (AA) [1] or theResource Allocation Index (RA) [46] (see also Table 2.1). Indeed, the indices lackthe notion of patterns and have been designed with undirected friendship networksin mind.

2.4.4 Triadic Closeness Example

To give a concrete example, consider the artificial network as depicted by Fig. 2.6and suppose triadic closeness TCgh shall be calculated between the nodes g and h.

The triad frequencies are listed in Table 2.2. Algorithm 2 shows the steps forcounting the frequency of patterns in graph G.


Fig. 2.6 Example network

f

g h

d ea

b c

Table 2.2 Triads in examplenetwork

ID Pattern Frequency ↓1 u ↔ z ↔ v 10

31 u ↔ z ↔ v ↔ u 6

2 u ↔ z → v 2

3 u → z ↔ v 2

4 u → z → v 2

5 u ↔ z ← v 2

7 u ← z ↔ v 2

8 u ← z ← v 2

12 u ↔ z → v ← u 2

27 u ← z ↔ v → u 2

36 u → z ← v ↔ u 2

11 u ↔ z ↔ v ← u 1

21 u ↔ z ↔ v → u 1

32 u ↔ z → v ↔ u 1

33 u → z ↔ v ↔ u 1

35 u ↔ z ← v ↔ u 1

37 u ← z ↔ v ↔ u 1

As shown in Fig. 2.6, the node f connects g and h via triad T01. Related to T01for calculation are T11 and T31. The weight wP(u,v,z) for the network in Fig. 2.6is given as wP(u,v,z) = 6+1

10 = 0.7. Thus, TCgh is given as TCgh = 0.7× 14 = 0.175.

Using the triadic closeness concept, with a probability of 0.17 T01 will be closedfrom g to h.


Algorithm 2 Pattern counting algorithm1: input: directed graph G2: for each Vertex u ∈ G do3: for each Vertex z ∈ getNeighbors(G,u) do4: for each Vertex v ∈ getNeighbors(G, z) do5: if equals(v,u) then6: continue7: end if8: id ← getPatternId(G,u, z,v)9: count(id) // increment count by 1

10: end for11: end for12: end for

Fig. 2.7 Indegreedistribution of GitHub

100

100 101 102 103 104

102

104

106

Number of Followers

Num

ber

of U

sers

N(k) ~ k −1.49

2.5 Data Collection

The following datasets have been used for testing the link prediction techniques.

• GitHub. The first network is based on GitHub’s follower network [12]. Thegraph was imported in our prediction framework through the GitHub API [13] inDecember 2012. The basic network characteristic in terms of follower (indegree)distribution is depicted by Fig. 2.7.

The plot shows a power-law distribution with the basic property N(k)∼ k−1.49

where N(k) is the number of nodes with indegree k. The follower graph counts1,105,150 users and 1,898,034 following relations (edges). Nearly 70 % of users(767,975) have no followers (zero indegree) and again about 70 % of users(769,283) do not follow any other user (zero outdegree).

2.5 Data Collection 21

Fig. 2.8 Indegreedistribution of GooglePlus

100 101 102 103 104 105100

101

102

103

104

Number of Followers

Num

ber

of U

sers

• GooglePlus. The second network represents a subset of nodes and edges fromGooglePlus [14]. We have obtained the network (plain text files) from theStanford Large Network Dataset Collection [41]. A description of the networkis also given in [27]. The degree distribution is depicted by Fig. 2.8. Thenetwork consists of 107,614 nodes and 13,673,453 edges. At a technical level,the network is managed within the prediction framework’s Flat File Store.The average degree 〈k〉 = 127 in this network is much higher than in theGitHub-based follower network, which has only 〈k〉= 1.7.

A possible explanation for the high differences in the average degree 〈k〉is the primary purpose of the platforms. GitHub is a platform for hosting andsharing source code repositories and the “follow” feature is by many people usedto follow top-developers. In GooglePlus, many people follow other people theypersonally know and use the platform to maintain social relations.

• Twitter. The third network is based on a subset of nodes and edges of Twit-ter [43]. The network was also obtained from the Stanford Large Network DatasetCollection [41] and is also managed within the Flat File Store. The networkcounts 81,306 nodes and 1,768,149 edges, thereby making it the smallest networkin our experiments. The degree distribution is depicted by Fig. 2.9. The averagedegree is given as 〈k〉 = 21.7. In addition, we show the degree of a muchlarger Twitter-based network in Fig. 2.9 in the top-right corner to show how thepresented network subset relates to the large network. The large network wasobtained in July 2009 by [21] and counts roughly 5× 107 nodes and 1.5× 109

edges.Both networks follow a similar distributional shape. In our experiments,

only the smaller network has been used. In combination with the other largernetworks (GitHub and GooglePlus), the smaller Twitter-based network providesa sufficient basis to compare link prediction metrics.


Fig. 2.9 Indegreedistribution of Twitter

100 101 102 103 104100

101

102

103

104

Number of Followers

Num

ber

of U

sers

100

101 102 103 104 105 106 107100

102

104

106

108

Number of Followers

Num

ber

of U

sers

b

a

2.6 Evaluation

We obtained three directed social follower networks to compare different metricsand to validate the suitability of TC. The networks (the whole follower network orsubsets thereof) include GitHub [12], GooglePlus [14], and Twitter [43].

2.6.1 Configuration

Here we discuss the link prediction configuration settings that were used toperform experiments. Experiments have been performed using the three previouslyintroduced datasets: GitHub (GH), GooglePlus (GP), and Twitter (TW). Table 2.3lists the prediction user set size |UP|, the validation set size |EV | and the root setsize |UR|.

For GitHub, the prediction user set UP consists of 10 % of the users which areconnected through 148,796 edges. From those edges we sampled 29,759 randomedges (20 %) and added them to EV . The root set UR is populated with 100 nodes.The size of the prediction target set UT can be easily calculated as UT =UP−UR. InGooglePlus, we use the entire user base for UP, which also results in approximatelythe same size of UP as for GitHub’s prediction user set. EV consists of 2,709,731edges (20 %) and the root set UR consists also of 100 nodes. The Twitter-baseddataset has the smallest number of nodes and, in the same manner as for GooglePlus,we also select all nodes for UP. EV has 333,577 edges (again 20 %) and the sameroot set size as for GitHub and GooglePlus is applied.

Using the configuration settings in Table 2.3, we obtain the graph GP(U,EP)upon which triad pattern mining and prediction is performed. The relative frequencyof each triad pattern in GP is listed in Table 2.4. The total number of triadpatters in each network is 423,034,532 (GitHub), 14,402,457,330 (GooglePlus), and

2.6 Evaluation 23

Table 2.3 Configurationsettings for link prediction

Configuration GH GP TW

|UP| 110,515 107,614 81,306

|EV | 29,759 2,709,731 333,577

|UR| 100 100 100

Table 2.4 Ratio of triads indifferent social networks

ID Pattern GH (%) GP (%) TW (%)

9 u ← z → v 53.23 9.97 7.04

6 u → z ← v 39.09 18.79 39.96

8 u ← z ← v 1.50 12.71 5.95

4 u → z → v 1.37 12.71 5.95

7 u ← z ↔ v 1.22 5.69 4.36

2 u ↔ z → v 1.14 5.69 4.36

5 u ↔ z ← v 0.63 3.93 6.21

3 u → z ↔ v 0.63 3.93 6.21

1 u ↔ z ↔ v 0.23 1.44 4.07

28 u ← z ← v → u 0.10 2.74 0.86

29 u ← z → v → u 0.10 2.74 0.86

19 u ← z → v ← u 0.10 2.74 0.86

26 u → z ← v → u 0.10 2.74 0.86

16 u → z ← v ← u 0.09 2.74 0.86

14 u → z → v ← u 0.09 2.74 0.86

25 u ↔ z ← v → u 0.05 0.62 0.70

39 u ← z → v ↔ u 0.05 0.62 0.70

13 u → z ↔ v ← u 0.04 0.62 0.70

27 u ← z ↔ v → u 0.04 1.41 0.87

12 u ↔ z → v ← u 0.04 1.41 0.87

36 u → z ← v ↔ u 0.04 1.41 0.87

31 u ↔ z ↔ v ↔ u 0.03 0.29 0.97

11 u ↔ z ↔ v ← u 0.01 0.21 0.53

33 u → z ↔ v ↔ u 0.01 0.21 0.53

21 u ↔ z ↔ v → u 0.01 0.21 0.53

37 u ← z ↔ v ↔ u 0.01 0.21 0.53

32 u ↔ z → v ↔ u 0.01 0.21 0.53

35 u ↔ z ← v ↔ u 0.01 0.21 0.53

17 u ← z ↔ v ← u 0.00 0.15 0.27

23 u → z ↔ v → u 0.00 0.15 0.27

22 u ↔ z → v → u 0.00 0.15 0.27

38 u ← z ← v ↔ u 0.00 0.15 0.27

15 u ↔ z ← v ← u 0.00 0.15 0.27

34 u → z → v ↔ u 0.00 0.15 0.27

18 u ← z ← v ← u 0.00 0.10 0.13

24 u → z → v → u 0.00 0.10 0.13


296,075,286 (Twitter). The patterns at the top in Table 2.4 are open triads T01–T09followed by closed triads T11–T36. The patterns are sorted in descending order bythe column GH.

The most common triad pattern in GitHub with 53.23 % is where both u and v arefollowed by z but neither u nor v follow z. This pattern has a relative frequency of7.04 % in Twitter, thereby being the second most common pattern in Twitter, and arelative frequency of 9.97 % in GooglePlus, thereby being the fourth most commonpattern in GooglePlus. The second most common pattern in GitHub, and the mostcommon pattern in GooglePlus and Twitter, is the pattern where both u and v followz but z follows neither of them. Triadic Closeness is defined as the likelihood thata given open triad (T01–T09) will be closed in a given social network. Thus, thetriad patterns T01–T09 are seen in relation with the closed triads to determine thecloseness between two nodes.

In the following the prediction results are presented by plotting ROC curves andcalculating AUC for each metric. TC is based on the frequencies of respective triadsin Table 2.4.

2.6.2 Prediction Results

In all our experiments, we use the metrics as defined in Table 2.1 and the introducedtriadic closeness TC. The configuration settings for the experiments are given inTable 2.3. As a general note, an AUC value above 0.5 indicates that a predictionalgorithm performs better than pure chance. Higher AUC indicates better predictionaccuracy. With regards to ROC curves, we group metrics into a single class if theirAUC values are identical.

GitHub As a first step we present the prediction results of GitHub-based experi-ments. The ROC curves are depicted by Fig. 2.10. The metrics and class correspon-dence is established in Table 2.5. We provide the AUC values along with the curvesin Fig. 2.10 and also in Table 2.5 for easier readability.

In GitHub, TC corresponds to C1 and an AUC value of 0.98. Thus, TCoutperforms the other methods and delivers the highest accuracy. RA delivers alsovery good results with an AUC of 0.97. Metrics below 0.5 such as SA, LHN, JA, SO,and HD are not suitable prediction methods for the GitHub-based social networks.The HitRatio at different thresholds is shown in Table 2.6. The HitRatio until HitRa-tio@50 is identical for CN, AA and TC. For HitRatio@100 and HitRatio@1000 TCoutperforms other methods. RA performs best at HitRatio@5000.

GooglePlus Next, we present the prediction results of GooglePlus-based experi-ments. The ROC curves are presented in Fig. 2.11 and the class correspondence isgiven in Table 2.7. TC has an AUC value of 0.96 followed by RA and AA with 0.94.Also, the simple CN metric delivers quite good results with an AUC of 0.93. LHNperform worst but still has an AUC above 0.5 thereby delivering acceptable resultsbut with low accuracy.

2.6 Evaluation 25

Fig. 2.10 ROC curves forGitHub-based results

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

Tru

e P

ositi

ve R

ate

C1: AUC=0.98C2: AUC=0.97C3: AUC=0.93C4: AUC=0.71C5: AUC=0.54C6: AUC=0.28C7: AUC=0.26C8: AUC=0.24

Table 2.5 AUC Classes forGitHub-based results

Class Definition AUC

C1 TC 0.98

C2 RA 0.97

C3 AA 0.93

C4 CN 0.71

C5 HP 0.54

C6 SA 0.28

C7 LHN 0.26

C8 JA, SO, HD 0.24

The HitRatio for GooglePlus-based results is shown in Table 2.8. TC performsbest at all thresholds and delivers the best results compared with the other methods.RA performs slightly better than AA in terms of HitRatio. LHN performed worstin terms of AUC but slightly better than HP with regards to HitRatio. Overall, onlyTC, RA and AA are suitable methods to perform link prediction in GooglePlus. Allother methods have no correct results at HitRatio@30.

Twitter Finally, we present the prediction results of Twitter-based experiments.The ROC curves are presented in Fig. 2.12 and the class correspondence is given inTable 2.9.

Again, TC performs best with an AUC value of 0.97. Second ranked are againRA and AA with an AUC of 0.96. Note in this context that all three metrics, TC,RA, and AA, give higher weights to those neighbors who have a lower degree (i.e.,“hub-depressed” behavior). However, by considering triad patterns TC outperformsall other metrics and delivers the most accurate results. Again, LHN ranks last withan AUC of 0.71. None of the metrics have an AUC lower than 0.5. Here the lowest


Table 2.6 HitRatio (%) for GitHub-based results

HitRatio HitRatio HitRatio HitRatio HitRatio HitRatio

Metric @10 @30 @50 @100 @1000 @5000

CN 12.5 12.5 12.5 12.5 12.5 50.0

SA 0.0 0.0 0.0 0.0 0.0 0.0

JA 0.0 0.0 0.0 0.0 0.0 0.0

SO 0.0 0.0 0.0 0.0 0.0 0.0

HP 0.0 0.0 0.0 0.0 0.0 0.0

HD 0.0 0.0 0.0 0.0 0.0 0.0

LHN 0.0 0.0 0.0 0.0 0.0 0.0

AA 12.5 12.5 12.5 12.5 37.5 62.5

RA 0.0 0.0 12.5 12.5 37.5 87.5

TC 12.5 12.5 12.5 37.5 62.5 75.0

Fig. 2.11 ROC curves forGooglePlus-based results

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

Tru

e P

ositi

ve R

ate


AUC is 0.71 making all metrics suitable methods for link prediction. As mentionedbefore, this was not the case for GitHub where many metrics perform below an AUCof 0.5.

The HitRatio for Twitter-based results is shown in Table 2.10. TC performs bestuntil HitRatio@100. For HitRatio@1000 and HitRatio@5000 RA performs slightlybetter. However, TC still performs best with respect to AUC and true positive rate.Also, the simple common neighbor (CN) methods provides acceptable results interms of HitRatio. LHN performs worst with regards to HitRatio (only 5.2 % atHitRatio@5000) and also with regards to AUC.

2.6 Evaluation 27

Table 2.7 AUC Classes forGooglePlus-based results


C1 TC 0.96

C2 RA, AA 0.94

C3 CN 0.93

C4 SA 0.90

C5 JA, SO 0.88

C6 HD 0.87

C7 HP 0.82

C8 LHN 0.58

Table 2.8 HitRatio (%) for GooglePlus-based results

HitRatio HitRatio HitRatio HitRatio@ HitRatio@ HitRatio

Metric @10 @30 @50 @100 @1000 @5000

CN 0.0 0.0 0.1 0.3 2.3 8.1

SA 0.0 0.0 0.0 0.0 0.1 1.4

JA 0.0 0.0 0.0 0.0 0.5 1.7

SO 0.0 0.0 0.0 0.0 0.1 1.5

HP 0.0 0.0 0.0 0.0 0.0 0.0

HD 0.0 0.0 0.0 0.0 0.1 1.7

LHN 0.0 0.0 0.0 0.0 0.0 0.2

AA 0.0 0.1 0.1 0.4 2.4 8.6

RA 0.1 0.2 0.3 0.6 3.4 12.8

TC 0.3 0.7 0.8 1.1 6.0 21.6

Fig. 2.12 ROC curves forTwitter-based results

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

Tru

e P

ositi

ve R

ate



Table 2.9 AUC ClassesTwitter-based results


C1 TC 0.97

C2 RA, AA 0.96

C3 CN 0.93

C4 HP 0.91

C5 SA 0.90

C6 JA, SO 0.85

C7 HD 0.83

C8 LHN 0.71

Table 2.10 HitRatio (%) for Twitter-based results

HitRatio HitRatio HitRatio HitRatio HitRatio HitRatio

Metric @10 @30 @50 @100 @1000 @5000

CN 2.6 6.1 9.2 13.1 44.1 72.1

SA 0.0 0.0 1.3 5.2 28.4 56.8

JA 1.7 3.1 3.1 7.9 34.1 56.8

SO 0.0 0.0 1.3 4.4 29.3 54.6

HP 0.0 0.0 0.0 0.0 0.4 40.2

HD 0.9 2.6 2.6 2.6 25.3 49.8

LHN 0.0 0.0 0.0 0.0 0.0 5.2

AA 2.6 6.1 9.6 14.4 48.9 78.6

RA 3.1 6.6 8.3 12.2 56.3 86.5

TC 3.9 8.7 11.8 14.8 55.5 86.4

2.7 Conclusions

The prediction of missing links and the prediction of future links is an important taskin the domain of social network analysis. The former helps to infer the “real” socialnetwork structure while the latter is used to give friendship as well as followingrecommendations to users. A wide range of local, global, and semi-local metricshave been proposed by previous work. A large body of existing literature, however,focuses on undirected networks only. This work closes this gap by focusing ondirected networks. Here we propose triad patterns to predict links between nodesin directed graphs. Our approach is called Triadic Closeness. We designed andimplemented a link prediction framework that is able to perform predictions inlarge-scale social networks. The framework’s architecture has been presented anddiscussed in detail. We performed experiments in three different social networks.First, we analyzed the effectiveness of our proposed approach in GitHub; a socialcoding community. Second, we obtained a subset of the GooglePlus network andthird we performed experiments in a subset of the Twitter follower network.

References 29

• The follower structure of GitHub and GooglePlus or Twitter is very different.Thus, the average degree of GitHub is significantly lower than the average degreein GooglePlus and Twitter.

• As expected, each follower network exhibits distinct triad frequencies. Thepresented approach helps to give higher weights to triads that are more frequentlyclosed (resulting in closed triangles).

• The pattern-based prediction approach delivers the best results among thecompared local methods. TC consistently outperformed other approaches. Thus,a pattern-based approach is better suited in directed social networks.

An important aspect will also be the extension of our approach towards globaland semi-local methods. As an example, the personalized PageRank method couldprovide the basis for pattern-aware link prediction. We are currently working onthe design of this method. In addition, we will be comparing machine learningtechniques for link prediction such as random forest and support vector machine(SVM) models.

References

1. L. A. Adamic and E. Adar. Friends and neighbors on the web. Social Networks, 25:211–230,2001.

2. L. M. Aiello, A. Barrat, R. Schifanella, C. Cattuto, B. Markines, and F. Menczer. Friendshipprediction and homophily in social media. ACM Trans. Web, 6(2):9:1–9:33, June 2012.

3. E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. Mixed membership stochasticblockmodels. J. Mach. Learn. Res., 9:1981–2014, June 2008.

4. U. Alon. Network motifs: theory and experimental approaches. Nature Reviews Genetics,8(6):450–461, June 2007.

5. L. Backstrom and J. Leskovec. Supervised random walks: predicting and recommending linksin social networks. In Proceedings of the fourth ACM international conference on Web searchand data mining, WSDM ’11, pages 635–644, New York, NY, USA, 2011. ACM.

6. V. Batagelj and A. Mrvar. A.: A subquadratic triad census algorithm for large sparse networkswith small maximum degree. Social Networks, pages 237–243, 2001.

7. A. P. Bradley. The use of the area under the roc curve in the evaluation of machine learningalgorithms. Pattern Recognition, 30:1145–1159, 1997.

8. M. J. Brzozowski and D. M. Romero. Who should i follow? recommending people in directedsocial networks. In L. A. Adamic, R. A. Baeza-Yates, and S. Counts, editors, ICWSM. TheAAAI Press, 2011.

9. A. Clauset, C. Moore, and M. E. J. Newman. Hierarchical structure and the prediction ofmissing links in networks. Nature, 453(7191):98–101, May 2008.

10. I. Esslimani, A. Brun, and A. Boyer. Densifying a behavioral recommender system by socialnetworks link prediction methods. Social Netw. Analys. Mining, 1(3):159–172, 2011.

11. Facebook. Online: www.facebook.com (last access June 2015).12. GitHub. Online: www.github.com (last access June 2015).13. GitHub. Online: http://developer.github.com/ (last access June 2015).14. GooglePlus. Online: www.plus.google.com/ (last access June 2015).15. M. Granovetter. The strength of weak ties. The American Journal of Sociology, 78(6):

1360–1380, 1973.

www.facebook.com

www.github.com

http://developer.github.com/

www.plus.google.com/


16. P. W. Holland, K. B. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. SocialNetworks, 5(2):109–137, June 1983.

17. P. W. Holland and S. Leinhardt. A method for detecting structure in sociometric data. AmericanJournal of Sociology, 76(3):492–513, 1970.

18. G. Jeh and J. Widom. Simrank: a measure of structural-context similarity. In Proceedings ofthe eighth ACM SIGKDD international conference on Knowledge discovery and data mining,KDD ’02, pages 538–543, New York, NY, USA, 2002. ACM.

19. G. Jeh and J. Widom. Scaling personalized web search. In Proceedings of the 12th internationalconference on World Wide Web, WWW ’03, pages 271–279, New York, NY, USA, 2003. ACM.

20. L. Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39–43,Mar. 1953.

21. H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a newsmedia? In Proceedings of the 19th international conference on World wide web, WWW ’10,pages 591–600, New York, NY, USA, 2010. ACM.

22. E. A. Leicht, P. Holme, and M. E. J. Newman. Vertex similarity in networks. Phys. Rev. E,73:026120, Feb 2006.

23. J. Leskovec, D. Huttenlocher, and J. Kleinberg. Predicting positive and negative links in onlinesocial networks. In Proceedings of the 19th international conference on World wide web,WWW ’10, pages 641–650, New York, NY, USA, 2010. ACM.

24. D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. In Pro-ceedings of the twelfth international conference on Information and knowledge management,CIKM ’03, pages 556–559, New York, NY, USA, 2003. ACM.

25. W. Liu and L. Lu. Link prediction based on local random walk. EPL (Europhysics Letters),89(5):58007, 2010.

26. L. Lu and T. Zhou. Link prediction in complex networks: A survey. Physica A: StatisticalMechanics and its Applications, 390(6):1150–1170, 2011.

27. J. J. McAuley and J. Leskovec. Learning to discover social circles in ego networks. In P. L.Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, NIPS,pages 548–556, 2012.

28. B. Meng, H. Ke, and T. Yi. Link prediction based on a semi-local similarity index. ChinesePhysics B, 20(12):128902, 2011.

29. R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon. Network motifs:Simple building blocks of complex networks. Science, 298(5594):824–827, Oct. 2002.

30. L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing orderto the web, 1999.

31. E. Ravasz, A. L. Somera, D. A. Mongru, Z. N. Oltvai, and A. L. Barabasi. Hierarchicalorganization of modularity in metabolic networks. Science, 297(5586):1551–1555, Aug. 2002.

32. A. Rettinger, H. Wermser, Y. Huang, and V. Tresp. Context-aware tensor decomposition forrelation prediction in social networks. Social Netw. Analys. Mining, 2(4):373–385, 2012.

33. D. M. Romero and J. M. Kleinberg. The directed closure process in hybrid social-informationnetworks, with an analysis of link formation on twitter. In W. W. Cohen and S. Gosling, editors,ICWSM. The AAAI Press, 2010.

34. G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, Inc.,New York, NY, USA, 1986.

35. G. Sautter and K. BâAŽÃaÃuâAŽÃaÃGhm. High-throughput crowdsourcing mechanisms forcomplex tasks. Social Network Analysis and Mining, pages 1–16, 2013.

36. D. Schall. Expertise ranking using activity and contextual link measures. Data Knowl. Eng.,71(1):92–113, 2012.

37. D. Schall. Service Oriented Crowdsourcing: Architecture, Protocols and Algorithms. SpringerBriefs in Computer Science. Springer New York, New York, NY, USA, 2012.

38. D. Schall and F. Skopik. Social network mining of requester communities in crowdsourcingmarkets. Social Netw. Analys. Mining, 2(4):329–344, 2012.

39. T. A. Snijders. Transitivity and Triads. University of Oxford, 2012. Online: http://www.stats.ox.ac.uk/~snijders/Trans_Triads_ha.pdf (last access 22-Feb-2013).

http://www.stats.ox.ac.uk/~snijders/Trans_Triads_ha.pdf

http://www.stats.ox.ac.uk/~snijders/Trans_Triads_ha.pdf

References 31

40. T. Sørensen. A method of establishing groups of equal amplitude in plant sociology basedon similarity of species and its application to analyses of the vegetation on danish commons.Biologiske Skrifter / Kongelige Danske Videnskabernes Selskab, 5(4):1–34, 1957.

41. Stanford. Online: http://snap.stanford.edu/data/index.html (last access January 2014).42. P. Symeonidis and N. Mantas. Spectral clustering for link prediction in social networks with

positive and negative links. Social Network Analysis and Mining, pages 1–15, 2013.43. Twitter. Online: www.twitter.com (last access June 2015).44. S. Wasserman, K. Faust, and D. Iacobucci. Social Network Analysis : Methods and Applica-

tions (Structural Analysis in the Social Sciences). Cambridge University Press, Nov. 1994.45. H. C. White, S. A. Boorman, and R. L. Breiger. Social structure from multiple networks. i.

blockmodels of roles and positions. The American Journal of Sociology, 81(4):730–780, 1976.46. T. Zhou, L. Lu, and Y.-C. Zhang. Predicting missing links via local information. The European

Physical Journal B - Condensed Matter and Complex Systems, 71(4):623–630, Oct. 2009.

http://snap.stanford.edu/data/index.html

www.twitter.com

Chapter 3Follow Recommendation in Communities

Abstract Follower networks provide means for informal information propagation.In this chapter we introduce an approach for recommending relevant users to follow.Our approach is based on the automatic analysis of user behavior and networkstructure. Link-analysis techniques such as PageRank and HITS provide the basisfor a novel recommendation model.

3.1 Social Collaboration Platforms

Social networks have become a central part for many people in their everydayactivities. The type of network used for different activities often varies dependingon the desired purpose. Professional networks such as LinkedIn are used to stayin touch with colleagues and coworkers. Personal social networks including thepopular Facebook platform enable people to engage with their friends and to followtheir news updates. News media and social network services such as Twitter allowpeople to follow short news updates (tweets) of celebrities and friends. Recently,another type of social network has become highly popular attracting millionsof users: online social collaboration networks. These networks enable people tocollectively work on projects. An example of such a social collaboration platform isGithub [1]. GitHub was launched in 2008 and enables people to work on public(open source) or private projects. Indeed, open source development has a longhistory (e.g., see [2]) and dates back to the 1950s and 1960s when IBM releasedsoftware sources of its operating systems and other programs [3].

A novel aspect of recent online social collaboration platforms on the WorldWide Web such as GitHub is that they provide improved support for socialnetworking features such as followers/followings or news feeds based on theusers’ social network. These online collaboration platforms enable users to discoverinteresting projects and repositories more quickly and let people collaborate almostinstantaneously. An intriguing hypothesis was postulated by [4] arguing that GitHubwill be the next big social network driven by what people do instead of who peopleknow. In professional networks such as LinkedIn people are mainly connectedbecause they know each other from, for example, past work experience.


33

34 3 Follow Recommendation in Communities

In networks such as LinkedIn or Facebook friendship is represented as recipro-cated links in an undirected graph. Services such as Twitter and recently GitHub arebased on a directed network approach. A directed network approach allows usersto follow other users based on their interest without requiring them to reciprocatethe relationship. In traditional social networks, some users may be followed bymany people without following many peers themselves (“stars” or “celebrities”).Is this also the case for online social collaboration networks such as GitHub?People in GitHub are mostly followed because they work on interesting projects.Thus, this difference between conventional social networks and online socialcollaboration networks requires a novel follow recommendation approach. Oneimportant aspect in knowledge-intensive disciplines such as software engineeringis to promote the effective dissemination of knowledge [5]. The authors in [6] foundthat formal routines should be supplemented by collaborative and social processesto promote awareness and learning. In our opinion, follower networks provideexcellent means to address the need for effective dissemination of knowledgethrough informal relations and information interest. Following the right personis essential to get information updates from the community leaders and “gurus”.Follow recommendations aim to solve the problem of selecting the right person tofollow.

Follower networks, information flows through re-tweets, and follow recom-mendations have been analyzed in great detail for platform such as Twitter [7–9]or in enterprise social media networks such as WaterCooler [10]. To our bestknowledge, there is no existing work that proposed context-sensitive followingrecommendations in online development networks.

In this chapter we present the following key contributions:

• Who to follow recommendation. Here we propose a method and algorithmfor follow recommendations. Recommendations can be based on behavior,network, or similarity. Our approach is based on network analysis techniques.User relevance with respect to following recommendations is based on a novelauthority metric. Authority in this work means being an expert or guru in aspecific area (e.g., expert/guru in “javascript” programming). The approach isspecifically targeted at online software development communities but may beapplied to other types of collaboration networks as well.

• Social network metrics and evaluation. We analyze our approach by usingsocial network (follower network) and activity data from GitHub. We introducethe used dataset and calculate various metrics such as reciprocity to supportour hypothesis that people in GitHub are mostly followed because they workon interesting projects. The proposed authority-based follow recommendationapproach is evaluated by using the GitHub dataset.

This chapter is structured as follows: Sect. 3.2 discusses relevant related work.Section 3.4 introduces our follow recommendation approach. The data collection

3.2 Background in Online Communities 35

is introduced in Sect. 3.5. Section 3.6 introduces social network metrics and ourevaluation. The chapter is concluded in Sect. 3.7 with an outlook on future research.

3.2 Background in Online Communities

We structure related work into relevant topics including analysis of online develop-ment communities, and social network analysis.

Online Development Communities An online social network is a communicationand collaboration medium that connects a large number of people. People withinthe social network stay together if their interaction dynamics leads to the emer-gent property that is called “community”. Here we focus on online developmentcommunities consisting of people developing collaboratively open source projects.A topological analysis of the SourceForge community was presented in [11]. Thefocus of the work was on role detection of users and cluster analysis. In [12],metrics with regards to open versus private software development were analyzedwith the focus on source code aspects. Measures to investigate the social-technicalcongruence in software development projects were established in [13]. The interplaybetween network metrics, software failures and software evolution was investigatedin, for example [14–16]. Collaboration and influence on GitHub was analyzed in[17] with the central focus on visualization techniques. Interesting directions withregards to the analysis of GitHub were presented in [18]. The authors [18] showedevidence for social collaboration on GitHub and proposed algorithms addressing theteam formation problem. Our authority-based recommendation approach can wellbe used to create expertise profiles that can be used to assist in the formation ofexpert teams [19, 20]. From a technical point of view, the basic structure of theGitHub API and the event schema was described in [21].

In this work, we analyze the GitHub online development community but focuson follower/following recommendations.

Social Network Analysis Social network analysis techniques offer a rich set oftheories (e.g., social network theory, small world phenomenon, power-laws, self-organization, and graph theory) and tools to analyze the topological features andhuman behavior in online communities [7, 22–27].

In many systems, including large-scale enterprises, mostly searchable directoriesor databases that include descriptions of the employees’ knowledge and experienceare used to locate experts. The problem with this approach is that social networksand also big companies are in a constant state of flux and change [28, 29]. Inlarge-scale online communities and dynamic organizations, it becomes infeasibleto constantly review and update the profile information of often rapidly changingexperience, skills, and rolls of experts. Specifically with respect to softwareengineering, the authors in [30] found that a person with the most modificationsto the code may be an expert within a community, and that expertise depends on


the area of the code that is being modified. Furthermore, the “degree of knowledgemodel” [31], showed how that expertise decays with subsequent changes by otherauthors.

We apply well-grounded theories and algorithms to the analysis of large scalesoftware development communities. Well known ranking algorithms to calculateimportance in linked environments include the HITS algorithm [32] and PageR-ank [33]. Personalization techniques such as topic-sensitive PageRank [34, 35]enable context-sensitive importance ranking. Link analysis algorithms have beensuccessfully applied to estimate actor importance in social and collaborativenetworks [27, 36–38] as well as to online mass production systems such as emergingcrowdsourcing environments [39]. The authors in [37] proposed link analysistechniques such as PageRank for expertise mining in online communities. However,no personalization with regards to expertise areas has been performed by prior work.In our previous work [27] we proposed context-sensitive link analysis algorithms forexpertise mining, but did not consider contributions of people to online communities(e.g., contributing source code to online repositories).

We propose a network-based metric to capture authority for follow recommen-dations. The proposed authority metric measures the relative community standing(i.e., reputation) of a community member. The metric is based on how much a personcontributes to the community (e.g., repository commits) and on how many existingfollower relations a person has.

3.3 Recommendation Types

Following/follower recommendations can be performed according to differentstrategies. The authors in [10] suggested three categories: behavioral, similarity,and network. We structure recommendation types for online software developmentcommunities in a similar manner but provide more strategies for network basedrecommendations. The following recommendation types can be distinguished:

• Behavioral. This recommendation type is based on already observed interactionbetween two people. For example, a person may have commented on codechecked in by some other person or a person may have replied to a question (e.g.,usage of library, report of bug, etc.) posted by someone in a discussion forum.Thus, based on the interaction between two people a follow recommendation canbe made.

• Similarity. As often observed in real life, people tend to have friends withsimilar characteristics and interests [40]. This phenomenon is called homophily.Homophily is the principle that a contact between similar people occurs at ahigher rate than among dissimilar people [41]. Shared characteristics includefollower degree and shared interests include watched repositories or interest inprogramming language.

3.4 Follow Recommendation 37

• Network. The network based recommendation type can be based on varioustechniques.

Collaborative filtering [42] is a technique for recommending content to usersbased on other users with similar interest. Collaborative filtering techniquesare commonly seen on e-commerce sites. Applied to our problem domain, ifA watches the same software repositories as B, than A might be interested infollowing B because both share similar interests.

Triadic closure [41] is another concept for network based recommendation.Suppose three community members A, B, and C and social ties between A-Cand B-C. As suggested by Granovetter, in most of these social structures, a triadicclosure occurs such that A and B are likely to become friends (or connected toeach other) the more they associate with C [41].

Network centrality of various types of vertices in a graph can be computedto determine the relative importance of vertices (e.g., methods such as PageR-ank [33] and HITS [32]). In social networks, network centrality techniques canbe used to estimate the importance of users [27]. The application is, for example,expert recommendation.

In the scope of this work, we focus on the network based recommendationtype. The first type (behavioral) and the second type (similarity) are not within thescope of this work. With regards to network based recommendation, we focus on acentrality based approach.

3.4 Follow Recommendation

3.4.1 Authority-Based Recommendation Model

The central concept in this work is user authority. Here we follow a social networkanalysis approach that is based on well established techniques including HITS [32]and PageRank [33]. In this work, somebody is considered to be an authority if s/hehas knowledge in a given area and is recognized by the community. These twofactors are combined in a novel authority metric. The presented techniques andalgorithms build upon our previous work [27, 39]. Our previous work introduceda social network mining framework and context-sensitive expertise ranking algo-rithms. Here we introduce new metrics suitable for online development communitiessuch as GitHub. Here, the basic follow recommendation model is established upona user-repository graph model (in contrast to our previous work in [27]).

To illustrate the basic idea of our follow approach, without going into details, weshow the most fundamental steps in Algorithm 3.

The input of the algorithm is a person P for whom follow recommendations shallbe computed. Thus, a personalization procedure to compute recommendations isneeded. The output of the algorithm is a top-k list of people to follow. The thresholdk can be dynamically adjusted. As a first step, a query Q is created from the person


Algorithm 3 High level follow recommendation algorithm1: input: Person P for whom recommendations should be computed.2: output: Top-k list of recommended people.3: Q ← createQueryFromProfile(P)4: D ← getPrincipleInterestDomains(P)5: for each User u ∈ U do6: if matches(u,D) then7: A(u)← computeAuthorityScore(u,Q)8: end if9: end for

10: return sortAndFilter(A)

P’s profile (function createQueryFromProfile() in line 3). As an example, if personP is interested in “javascript” and “closure”, then s/he may want to follow peoplewho work with such languages. The query would therefore consist of the keywordsQ = “javascript”, “closure”. The detailed mechanisms for extracting the query Qfrom P’s user profile is not detailed in this work.

The next step is to extract principle interest domains for person P. Principleinterest domains are, for example, “web engineering”, “automation”, “embeddedsystems development”, “physics simulations”. A person with background webengineering and interest in javascript-based web frameworks for e-commerce mayor may not be interested in following a person with background “automation” whois working in javascript-based industry monitoring frameworks. The applicationdomains can be very versatile and it may also strongly depend on the person’sinterest to follow somebody from another domain. Here, we highlight the fact thatsuch filtering (see matches(u,D) in line 6 which evaluates to true or false) may beneeded to improve recommendations. However, it would be beyond the scope of thiswork to elaborate on filtering by principle interest domains in detail.

The core focus of this work is depicted by the function computeAuthorityScore().Here the authority scores are computed that are used to perform a ranking ofrecommended users to follow. The function sortAndFilter() truncates the list (ifdesired) and returns a sorted list of people (sorted by highest to lowest authorityscore).

Models to compute authority such as the popular hubs and authority approach[32] are rooted in the link analysis domain to rank Web pages. Here we applylink analysis mechanisms to evaluate user authority with respect to communitycontributions. Detecting authoritative users is not only important for follow recom-mendations but also for identifying key players in the community. These users havestrong impact with regards to community cohesion and evolution. The followingsections details the context-sensitive authority ranking model.


Fig. 3.1 User-repositorygraph model

Users Repositories

Wae

Wbe

Wbf

Wbg

Wcg

a e

f

g

b

c

javascript

java

javascript

java

3.4.2 Personalized Authority Ranking

Here we define the follow recommendation model that is based on link analysistechniques. The basic idea is to compute user authority based on performedrepository actions (e.g., coding activities, bug fixing, etc.). The concepts are depictedby Fig. 3.1.

Suppose the community consists of the set of users U = a,b,c and the setof software repositories R = e, f ,g. The user-repository graph can be modeledas a directed bipartite graph GB(VU,VR,EB) where VU represents the set of usersand VR the set of repositories. An edge (u,r) ∈ EB is established from u to r iffu performs an action in r. An edge between u and r is weighted by wur based onperformed actions. Each repository is associated with a programming language. InFig. 3.1, e and f are associated with javascript and g is associated with java. Auser gains experience in a programming language if the user performs actions inlanguage-related repositories. The programming languages denote the context forour personalized authority ranking approach.

The set of users a,b has gained experience in javascript because they haveperformed actions in the repositories e and f whose language is javascript. The userb is experienced in javascript and java because b has performed actions in e, f ,g.The users a and c are only experience in javascript and java, respectively.

A quite natural and intuitive approach to rank users and repositories in GB is tomodel importance using the notion of hubs and authorities1 as introduced in [32]:

A(u) = ∑(u,r)∈EB

H(r) H(r) = ∑(v,r)∈EB

A(v) (3.1)

1The algorithm introduced by Kleinberg [32] to compute the scores is called Hyperlink InducedTopic Search (HITS).


A user u ∈ VU has high authority if u contributes to important repositories.The authority of u is depicted by the authority score A(u). A repository r ∈ VR

is important if authoritative users contribute to it. The repository importance isdepicted by the hub score H(r). Thus, an important “hub” attracts many authoritativeusers. This recursive definition of user and repository importance provides thebasis for our ranking approach. Other centrality metrics such as PageRank [33]cannot discriminate between two types of scores. However, when compared withPageRank, HITS is less stable to small perturbations [43]. Thus, a combined modelwould bring the advantage that one can compute two scores (property of HITS)and that the algorithm is rank stable (property of PageRank). Here we follow therandomized HITS approach as proposed in [43]:

A(u) = (1−λa)p(u)+λa ∑(u,r)∈EB

H(r) (3.2)

H(r) = (1−λh)p(r)+λh ∑(v,r)∈EB

A(v) (3.3)

Equations (3.2) and (3.3) show a natural way of designing a random-walk basedalgorithm following the HITS model. However, the randomized HITS approachis, like PageRank, stable to small perturbations [43]. The symbols p(u) and p(r)depict personalization vectors that may be assigned uniformly for each node suchthat p(u) = 1

|VU | and p(r) = 1|VR| .

Non-uniform personalization vectors result in personalized rankings (cf. person-alized PageRank [33]). The parameters λa and λh with 0≤ λ ≤ 1 allow for balancingbetween authority/hub weights and personalization weights. A typical value for λis 0.85 [33]. Assigning lower values to λ means that higher importance is given tothe personalization weights; thereby reducing the “network effect” of the rankingalgorithm.

The idea of our personalized authority ranking approach is to compute rankingscores with respect to certain interest areas. The demanded areas of interest arepassed via the keyword based query Q = q1,q2, . . . ,qn to the ranking algorithm.Each query keyword qn corresponds to a desired area of interest. An interest area isidentified via the name of a programming language (for example, Q= “javascript”,“java”). A query returns a ranked list of people based on the demanded set ofinterest areas.

Based on the discussion on personalization techniques and the depicted graphmodel in Fig. 3.1, we refine the hubs and authorities approach as follows:

A(u;Q) = (1−λa)p(u;Q)+λa ∑(u,r)∈EB

wurH(r;Q) (3.4)

H(r;Q) = (1−λh)p(r;Q)+λh ∑(v,r)∈EB

wvrA(v;Q) (3.5)


Using this model, the authority of user u is computed with respect to aspecific context that is given by query Q. A central role in this model plays thepersonalization vector p(u;Q). Notice, also the importance [“hubness” depicted byH(r;Q)] of a repository r is computed with respect to Q. However, since A(u;Q)is personalized by assigning non-uniform weights to p(u;Q) also H(r;Q) will beinfluenced by p(u;Q). Thus, with λh = 1 we define H(r;Q) as:

H(r;Q) = ∑(v,r)∈EB

wvrA(v;Q) (3.6)

We substitute H(r;Q) in Eq. (3.4) and have:

A(u;Q) = (1−λ )p(u;Q)+λ ∑(u,r)∈EB

∑(v,r)∈EB

wurwvrA(v;Q) (3.7)

The next step is to decompose the query Q as follows:

A(u;Q) =(1−λ ) ∑q∈Q

wqp(u;q)+λ

∑(u,r)∈EB

∑(v,r)∈EB

∑q∈Q

wurwvrwqA(v;q)(3.8)

The weight wq is associated with a particular keyword q with wq = 1|Q| for

uniform weights and ∑q wq = 1. In the next steps we apply some of the ideas ofthe PageRank linearity theorem to our proposed ranking model in Eq. (3.8). This ispossible because Eq. (3.8) has a PageRank-like structure. The PageRank linearitytheorem has been originally introduced by [34] to create topic-sensitive importancescores for Web-pages, but has not been applied in follow recommendations.

For constant λ we show that authority can be computed as A(u;Q) =

∑q∈Q wqA(u;q). We restructure Eq. (3.8) to first iterate over each q ∈ Q:

A(u;Q) = ∑q∈Q

wq(1−λ )p(u;q)+ ∑q∈Q

wqλ

∑(u,r)∈EB

∑(v,r)∈EB

wurwvrA(v;q)(3.9)

The final step is shown in Eq. (3.10):

A(u;Q) = ∑q∈Q

wq[(1−λ )p(u;q)+λ

∑(u,r)∈EB

∑(v,r)∈EB

wurwvrA(v;q)]

= ∑q∈Q

wq[A(u;q)

]

(3.10)


The model as depicted by Eq. (3.10) brings the following important benefits:

• Ranking scores for individual interest areas can be precomputed and saved in adatabase. This is typically done periodically in an offline manner.

• At query time, the precomputed ranking scores are retrieved and aggregated to acomposite score. This step is performed online at low computational cost.

The proposed model computes user authority with respect to a certain interestarea by using personalization techniques. This is a different mechanism than (1)performing matching of users based on interest areas and then (2) computingauthority without using personalization techniques. In our approach, authority ofusers is computed by considering all repositories the user has performed actionsand giving preference to those repositories by using interest specific personalizationweights. This mechanism better captures community-wide reputation by not onlycomputing authority based on a small portion of the user-repository graph butinstead using the entire user-repository graph. Thus, our approach follows aPageRank-like model.

In addition, our proposed model is important due to the computational com-plexity and the inability to compute personalized rankings in an online manneronly. For large social networks and collaborative platforms such as the GitHub,the computation of ranking scores may take a long time depending on availablehardware and resources.

3.4.3 Weights and Personalization Metrics

The edge weight wur, which is not personalized or “context” dependent (cf. alsoFig. 3.1 and related discussions), is calculated as follows:

wur =∑t∈T f (u,r, t)

∑z∈R(u)∑t∈T f (u,z, t)(3.11)

The set T denotes event types (i.e., the type of user action). The set R(u) depictsall repositories u is connected to in GB (u’s neighbors). The function f (u,r, t)retrieves the event count of user u in repository r of event type t.

As mentioned before, personalization is done for individual interest areas andranking scores are precomputed offline. Assume α is an interest area and p(u;α) isthe corresponding personalization vector for users. Generally, we assign to p(u;α)an interest area specific weight and non-interest area specific weight. For brevity, letR(u;α) = match(u,α) be the set of u’s repositories matching the interest area α .Equation (3.12) defines the personalization vector p(u;α):

p(u;α) = w1 ∑r∈R(u;α)

wur +w2kin(u;G)/ ∑v∈VU

kin(v;G) (3.12)


with w1 +w2 = 1. The first part ∑r∈R(u;α) wur assigns interest area specific weightsto p(u;α). If a given user performs many actions in interest area related repositories(i.e., R(u;α)) then also the area specific weight will be higher. The interest arearelated weight will be more important than the other part and thus w1 > w2. Thesecond part kin(u;G)/∑v∈VU

kin(v;G) is the non-interest area specific weight wherekin(u;G) depicts u’s indegree (follower count) in the follower graph G. This weightincreases the likelihood that users with high reputation in terms of follower count areranked higher than users that are not followed. However, because higher importanceis given to the interest area specific weight, we ensure that users have primarilyrelevant experience and not only many followers.

3.5 Data Collection

3.5.1 GitHub Community

At a high level, GitHub provides information regarding entities and events througha REST-based API.2 Information regarding entities are, for example, user details(followers, followings, personal user details) or details regarding repositories (e.g.,watchers). Thus, entity information captures the current state of users, repositories,and artifacts. GitHub also provides access to events that describe the dynamic viewand state changes. Events are generally generated by user actions. Technical detailsregarding the REST API interface as well as the event structure can be found in[21] and online at [44]. The GitHub archive [45] offers access to GitHub events.Events can be downloaded in JSON format and inserted into a database for queryprocessing.

We collected GitHub data using the following two methods:

• We retrieved the entire follower/following graph using the GitHub API. Thegraph was obtained in December 2012.

• We further downloaded all GitHub events between March 11, 2012 and Decem-ber 12, 2012 from the GitHub archive [45] amounting for roughly 10 months ofevent data. When performing this research, event data was available starting fromMarch 11, 2012.

Figure 3.2 gives an overview of the different event types3 captured in the10-months time frame. The total number of events captured is 36,094,501. Thenumber of distinct users associated with these events is 1,052,916. The total number

2http://developer.github.com/v3/.3Event Types: 1 PushEvent, 2 CreateEvent, 3 WatchEvent, 4 IssueCommentEvent, 5 IssuesEvent,6 ForkEvent, 7 PullRequestEvent, 8 GistEvent, 9 FollowEvent, 10 GollumEvent, 11 CommitCom-mentEvent, 12 PullRequestReviewCommentEvent, 13 MemberEvent, 14 DeleteEvent, 15 Down-loadEvent, 16 PublicEvent, 17 ForkApplyEvent.

http://developer.github.com/v3/


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 170%

20%

40%

60%

80%EventsDistinct UsersDistinct Repositories

Fig. 3.2 Basic event statistics by type

of distinct software repositories is 2,334,921. The most common event is thePushEvent amounting for 46 % of the overall number of events. A PushEvent is acommit on a repository. Second is the CreateEvent with 14 % that is triggered whenan object (“repository”, “branch”, or “tag”) was created. WatchEvent, IssueCom-mentEvent, IssuesEvent amount for 9 %, 8 %, and 5 % respectively. Detaileddescriptions regarding the semantics of event types can be found online [44]. Asexpected, most events result from actual contributions such as repository commitsand the creation of objects.

Table 3.1 provides a list of most popular programming languages based on thecount of users watching repositories on GitHub. Language popularity is certainlymore biased towards the Linux community where the base technology Git emergedfrom. However, this bias does not have impact on our follow recommendationapproach. On GitHub, javascript is the most popular language with most watchers.It has more than twice as many watchers then the second ranked language.

To understand repository popularity, Fig. 3.3 shows repository watch distribu-tions. Note, all scatter plots in this chapter have logarithmic scale on both thex-axis and y-axis. The actual distribution of the number of repositories shownagainst the number of users is depicted by Fig. 3.3a. The number of languageswatched by how many users is shown by Fig. 3.3b. The depicted distribution inFig. 3.3a follows a power law with N(k) ∼ k−1.97 with 132,446 users watching onerepository and one user watching 28,080 repositories. In total 406,376 repositoriesare watched by 327,222 users. Figure 3.3b the number of users over the number ofwatched languages. The basic characteristics are that 152,577 users watch exactlyone language type and the maximum number of languages is 73 watched by oneactor.


Table 3.1 Top-20 watchedlanguages

Rank Language Watchers ↓1 javascript 1,004,137

2 ruby 401,694

3 python 273,070

4 objective-c 243,615

5 php 212,604

6 java 170,359

7 c 128,394

8 c++ 92,470

9 shell 63,258

10 c# 49,934

11 coffeescript 44226

12 viml 42595

13 scala 26948

14 perl 26757

15 clojure 24525

16 go 20771

17 emacs lisp 13948

18 erlang 13834

19 actionscript 13339

20 haskell 11378

100 103 104 105

Number of Repositories

Num

ber

of U

sers

Watched Repositories:N(k) ~k−1.97

100 101 102

100

102

104

106

100

102

104

106

Number of Languages

Num

ber

of U

sers

101 102

a b

Fig. 3.3 Watched repository and number of languages. (a) Watched repositories. (b) Numberlanguages

3.5.2 User Activity by Location

The next step in our analysis is to map user activity to geographic location. Useractivity is observed through the number of events (regardless of event type). Weextract location information from events and group events by location and repository


USA

UK

Germany

France

Japan

China

Canada

Australia

Russia

Spain

Netherlands

Switzerland

Sweden

India

Brazil

38.3%

10.3%9.1%

7.9%

7.2%

5.5%

Fig. 3.4 Number of events by country

language. We obtained triples of location, programming language, and frequencyof event occurrence. Based on the resulting triples, we perform manual filteringto remove invalid information. Incomplete information (e.g., missing country) hasbeen corrected where possible. The resulting data has been overlaid on a world mapand color-coded by frequency. This will be later shown in Fig. 3.5. In total, we havecollected 758 location, programming language, and frequency of event occurrencetriples based on 5,615,375 events with valid location information.4

First of all, we give a summary of activity by country. Figure 3.4 shows thepercentage of activity by country based on the aforementioned number of events(about five million events). As shown in Fig. 3.4, the USA has the largest share with38.3 % followed by three European countries—the UK, Germany and France—andtwo Asian countries—Japan and China. India, an emerging market for softwaredevelopment, still ranks only at place 14. The next step is to provide detailedlocation information by showing a heatmap based on GitHub events. The heatmap isweighted by event count (best viewed in color online). The resulting map is shownby Fig. 3.5.

Most user activity clearly happens in the USA and in Europe (see Fig. 3.5).Within these regions, activities span also the largest geographic location with eventsoriginating from many different cities. In the USA, most events originate from cities

4Location information is valid if the location can be mapped correctly to a geographic place.GitHub users may provide false location information which we are unable to control or validate.


Fig. 3.5 Mapping user activities to location

located on either the east- or the west-coast. Events in Europe are scattered amongmany different cities. In Russia, for example, events almost exclusively originatefrom Moscow.

3.5.3 Programming Languages

In the following we compare the actor interest across programming languages.The question we attempt to answer is whether actors prefer to watch a pair ofprogramming languages (e.g., java and c#) and what are the similarities betweenlanguages. This is expressed as the overlap of watchers. Suppose the set of users Ul1watches language l1 and the set of users Ul2 watches language l2, than the overlapsimilarity of watchers by language is calculated as:

simwatchers(Ul1 ,Ul2) =|Ul1 ∩Ul2 |

|Ul1 |(3.13)

Figure 3.6 shows a matrix plot using a color grid to depict watcher similarities.We selected the top-20 languages to compare the joint interest of watchers acrossprogramming languages. Notice, the overlap similarity metric used in this context[cf. Eq. (3.13)] is asymmetric and thus the matrix in Fig. 3.6 is also asymmetric.Both axis show programming languages sorted by popularity: left-right (x-axis) andtop-down (y-axis).

Javascript-based repositories have the largest number of watchers and thus mostother languages share a high degree of similarity in terms of common watcherswith javascript (see first column). Scala and clojure have high similarities with java(see column 6), which is not surprising due to their close relationships to java at


2 4 6 8 10 12 14 16 18 201 javascript

2 ruby3 python

4 objective−c5 php6 java

7 c8 c++

9 shell10 c#

11 coffeescript12 viml

13 scala14 perl

15 clojure16 go

17 emacs lisp18 erlang

19 actionscript20 haskell

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fig. 3.6 Overlap similarities of watchers per language

a technical level (both languages target the java virtual machine as an executionplatform). The languages go, erlang, and haskell have high similarities with the clanguage (see column 7). Other languages have surprisingly low similarities withthe language c# and vice versa (see column and row 10).

Figure 3.7 shows the hierarchical clustering plot which was created based onthe overlap similarities of watchers per language. The “correlation” metric wasused to calculate the pairwise distance between pairs of objects. The agglomera-tive hierarchical cluster tree was created through the average linkage method tomeasure the distance between clusters. Cluster quality was evaluated using thecophenetic correlation coefficient to compare quality against different metrics (e.g.,“correlation” metric versus “city block metric”). The value of the coefficient usingthe “correlation” metric and average linkage was 8.13, which denotes a suitableconfiguration when compared with other configurations (i.e. metrics).

The observation in Fig. 3.7 are consistent with the previous discussions andpresent a good clustering of languages. A low distance presents a high similaritybased on high watcher overlaps. The language with the lowest similarity (c#) isshown at the bottom of the figure.

3.5.4 Follower Graph and Reciprocity

Follower Graph Not only user-repository relations are used in our recommenda-tion model but also follower relations. Therefore, we analyze GitHub’s followergraph. The follower graph counts 1,105,150 users and 1,898,034 following relations


1 javascript2 ruby

11 coffeescript3 python

9 shell 12 viml

7 c8 c++

14 perl16 go

4 objective−c5 php

18 erlang6 java

13 scale15 clojure

17 emacs lisp20 haskell

19 actionscript10 c#

0.3 0.4 0.5 0.6 0.7Distance

Fig. 3.7 Hierarchical clustering of languages

100 105100

102

104

106

Degree: N(k) ~ k−Τ1.64

100 102 104100

102

104

106

100 105100

102

104

106

Followers:N(k) ~k− 1.49

Following:N(k) ~ k−1.77

a b

Fig. 3.8 Follower graph degree distributions. (a) Degree. (b) Followers/following

(edges). Figure 3.8 plots the degree, the indegree (how many followers a givenuser has) as well as the outdegree (how many users a given user is following)distributions of the graph G.

One can observe a typical power-law distribution with degree N(k) ∼ k−1.64,indegree (Followers) with N(k) ∼ k−1.49 and outdegree (Followings) with N(k) ∼k−1.77. Nearly 70 % of users (767,975) have no followers (zero indegree) and againabout 70 % of users (769,283) do not follow any other user (zero outdegree). Power-law degree distributions are an indicator that the GitHub follower graph exhibitssimilar structural characteristics as other social networks.


Reciprocity is the concept in directed social networks that two actors are mutuallyconnected to each other (see for example [22]). In a network such as the followernetwork of GitHub or also Twitter this means that the reciprocal relation between uand v is given if u follows v and v follows u. In the following discussions we will callthe reciprocal neighbors of a given user “reciprocal friends”. We analyze reciprocityin the context of our work because reciprocity information could be a valuablesource for the follow recommendation model. In addition, analyzing reciprocityhelps understanding the primary use of the following feature. The following featurecould be used to maintain personal relations or on the contrary to follow popularcommunity members (“stars” or “gurus” which are typically hubs in the socialnetwork). Depending on the primary use, the recommendation model could beadjusted to the needs of the platform.

The GitHub follower network has very low reciprocity. Only 13 % of userpairs have a reciprocal relationship and 87 % of user pairs have an uni-directionalrelationship. To compare GitHub’s reciprocity with other social network sites:Flickr [26] has a reciprocity of 68 % and a social networking site for personalcommunications operated by Yahoo! [28] has 84 %. These sites have much higherreciprocity. Twitter has also a low reciprocity of only 22 % as reported in [7]. Alsoon GitHub, 86 % of users are not followed by their followings.

These observations with regards to reciprocity yield the conclusion that theGitHub follower graph is to a large degree comparable to an information servicesuch as Twitter rather than a typical social (friend) network. A possible explanationfor this behavior is that on GitHub many users follow other popular users to getupdates regarding their coding activity instead of maintaining personal relations.Our recommendation model is well suited for such a social network becauseit is based on the concept of authority and reputation. If reciprocity would behigher, other recommendation techniques such as the triadic closure model utilizing(existing) personal relations may be considered as well.

Based on the presented GitHub data, we performed various ranking experiments,which are shown next.

3.6 Evaluation

We perform ranking experiments to analyze the quality of the follow recommenda-tion approach. We use the previously developed recommendation model as detailedSect. 3.4. The goal of our evaluation is to understand the quality and impact ofthe personalization techniques. This is done by comparing the top-k results ofdifferently personalized rankings and their results. Since all data and user profilesare public available, we perform ranking and check the GitHub homepages andactivities of the top-ranked users. The results can also be easily verified by thereader.

For λ we use a value of λ = 0.9 for all experiments. A usual value for λ is withinthe range 0.8 ≤ λ ≤ 0.9. The metric weights w1 and w2 are set to w1 = 0.9 and

3.6 Evaluation 51

Table 3.2 Top-10 followrecommendations withoutpersonalization

Rank User kout Followers

1 mojombo 9 8,570

2 torvalds 2 8,377

3 defunkt 21 8,373

4 schacon 19 5,748

5 paulirish 61 5,694

6 jeresig 3 4,925

7 pjhyett 1 4,509

8 ryanb 139 4,081

9 android 86 3,889

10 visionmedia 299 3,725

w2 = 0.1; thereby giving higher importance to expertise area specific personalizationweights. The weights and personalization are calculated based on the frequency ofthe various GitHub events.

3.6.1 Recommendations Without Personalization

The first results are calculated without personalization (i.e., the interest area-specificmetrics and personalization weights are all zero). The top-10 results are shown byTable 3.2. The actual GitHub user profile can be found online using the link: https://github.com/User.

The top-10 list in Table 3.2 gives a list of distinguished software engineers andcommunity contributors. Users at rank 1 (mojombo), 3 (defunkt), 4 (schacon), and7 (pjhyett) are GitHub staff members and, not surprisingly, rank among the top.The Linux father torvalds ranks at position 2. At positions 5 (paulirish) ranks afront-end developer and Google Chrome developer relations engineer. At 6 (jeresig)ranks the creator of the jQuery javascript library. At 8 (ryanb) ranks a producerof ruby on rails screencasts and at 9 (android) ranks the “android” user associatedwith the android framework, kernel, and system core. At 10 (visionmedia) ranks ajavascript contributor. The third column in Table 3.2 shows the outdegree kout in GB

(i.e. the number of repositories the user is involved in). There is a correspondencebetween rank and number of followers. It becomes already apparent that thetop-10 ranked users in Table 3.2 may have very diverse expertise areas withregards to programming languages. We observe a mixture of “javascript”, “c”, and“java” experts.

https://github.com/User

https://github.com/User


3.6.2 Personalized Recommendations

Thus, as the next step we take full advantage of personalization and perform rankingfor the interest area “javascript”. Table 3.3 shows the top-10 ranking results. Thesecond column shows users and repositories. For each user we provide the top-5repositories by number of user contributions.

The repository language (if available) is provided in parenthesis next to therepository name. The actual number of repository actions has been omitted forprivacy reasons. Furthermore, kout depicts the outdegree of a user in GB (the numberof repositories a user has been working on), kin depicts the repository indegree(the number of contributors), and the column Followers showing the numbers offollowers.

By looking at the results in Table 3.3, two aspects become immediately appar-ent:

1. The top-10 list is populated by users that contribute to popular javascriptrepositories.

2. The list is no longer correlated in a strict order by the number of followers.

Both observations demonstrate the intended behavior of our ranking algorithm.Not only high reputation in terms of number of followers is a predominant factor,but also the number of contributions to relevant repositories boosts user authority.By doing so, users gain expertise in a given interest area and are thus ranked at highpositions by our ranking algorithm.

Looking at the actual ranked list, the users ranked at 1 (fat) and 2 (caniszczyk)are employees at Twitter and provide contributions to the most popular javascriptrepository. Ranked at position 3 (mdo) is a designer who is employed at GitHub.What all users have in common is that they are actively engaged in javascriptdevelopment and are also mostly followed by many people. However, this is nota requirement to be ranked at a top position (for example, see user ranked at 2 whohas 91 followers).

To evaluate other interest areas and combinations of languages, we have per-formed ranking for the keywords “java”, “scale”, “closure” (see Table 3.4) andalso “objective-c”, “c” (see Table 3.5). Due to space constraints, we provide onlythe top-3 ranked users as follow recommendation.

As clearly visible in Table 3.4, the top-rankings change in favor of the demandedinterest areas (i.e., java and related languages). All top-3 ranked actively contributeto java-related repositories. The first ranked user is also popular in terms ofnumber of followers. Second ranked is the Apache Software Foundation, whichis clearly a top-authority in java software development. The GitHub user apachecan be regarded as a proxy for the real persons behind the account. The third usercontributes to similar or the same repositories (e.g., “storm”) as the first rankedusers. Storm is a distributed realtime computing system in some features similar toHadoop. Although present in the query, scala repositories are not listed among thetop-3 (top-listed) repositories. Finally, we show the results for ‘objective− c′, ‘c′in Table 3.5.

3.6 Evaluation 53

Table 3.3 Top-10 follow recommendations for “javascript”

Rank User and top-5 repositories kout / kin Followers

1 fat 30 1,597

/twitter/bootstrap (js) 4

/twitter/bower (js) 8

/maker/ratchet js 3

/twitter/hogan.js (js) 3

/twitter/recess (js) 2

2 caniszczyk 34 91



/twitter/ambrose (js) 4

/twitter/twitter.github.com (js) 1

/twitter/bootstrap-server (js) 3

3 mdo 6 879


/twitter/bootstrap-server (js) 3

/mdo/github-buttons 1

/mdo/code-guide 1

/mdo/sublime-snippets 1

4 paulirish 61 5,694

/h5bp/html5-boilerplate (js) 8


/yeoman/yeoman js) 7

/h5bp/mobile-boilerplate (js) 4

/paulirish/infinite-scroll (js) 3

5 addyosmani 46 3,110

/addyosmani/todomvc (js) 3


/addyosmani/backbone-fundamentals (js) 1

/addyosmani/jquery-ui-bootstrap (js) 3

/yeoman/yeoman (js) 7

6 visionmedia 299 3,725

/visionmedia/express (js) 1

/visionmedia/jade (js) 1

/visionmedia/mocha (js) 2

/senchalabs/connect (js) 1

/component/component (js) 2

7 jzaefferer 53 302

/jquery/jquery (js) 10

/jquery/jquery-mobile (js) 14

/jquery/jquery-ui (js) 9

/jzaefferer/jquery-validation (js) 2

/jquery/qunit (js) 5

(continued)


Table 3.3 (continued)


8 wycats 43 3,259


/rails/rails (ruby) 21

/emberjs/ember.js (ruby) 7

/wycats/handlebars.js (js) 3

/tildeio/rsvp.js (js) 3

9 scottgonzalez 47 261


/jquery/jquery-mobile (js) 14

/jquery/jquery-ui (js) 9

/jquery/qunit (js) 5

/jquery/plugins.jquery.com (js) 3

10 mbostock 17 1,477

/mbostock/d3 (js) 1

/square/cubism (js) 1

/square/crossfilter (js) 1

/square/cube (js) 2

/d3/d3-plugins (js) 6

Table 3.4 Top-3 follow recommendations for “java”, “scala”, and “clojure”


1 nathanmarz 23 857

/nathanmarz/storm (java) 2

/nathanmarz/cascalog (clojure) 2

/nathanmarz/storm-starter (java) 1

/nathanmarz/storm-contrib (java) 9

/nathanmarz/storm-deploy (clojure) 2

2 apache 242 0

/apache/cassandra (java) 1

/apache/incubator-cordova-android (java) 1

/apache/mahout (java) 1

/apache/lucene-solr (java) 1

/apache/hadoop-common (java) 1

3 jasonjckn 9 21

/nathanmarz/storm (java) 2

/nathanmarz/storm-contrib (java) 9

/nathanmarz/storm-deploy (clojure) 2

/nathanmarz/storm-mesos (java) 2

/jasonjckn/storm 1

3.7 Conclusions 55

Table 3.5 Top-3 follow recommendations for “objective-c” and “c”

Rank User and top repositories kout / kin Followers

1 soffes 45 1,100

/nothingmagical/cheddar-ios (objective-c) 2

/soffes/sstoolkit (objective-c) 1

/soffes/sspulltorefresh (objective-c) 1

/soffes/sskeychain (objective-c) 1

/soffes/ssziparchive (c) 1

2 torvalds 2 8,377

/torvalds/linux (c) 1

/torvalds/subsurface (c) 2

3 php-pulls 23 2

/php/php-src (c) 2

/php/systems (c) 2

/php/pecl-networking-mqseries (c) 2

/php/web-php (php) 2

/php/php-gtk-src (c++) 1

Here we see also a similar behavior as previously. Personalization results inrankings of users who are experienced in the demanded interest areas. Top ranked isa popular objective-c developer (according to his online user profile) who has alsomany followers. Second ranked is torvalds who has undoubtedly high expertise in“c”. Third is also a proxy-account php-pulls. Similar to apache, the account is aproxy for the real people (experts) behind it.

To summarize the results of our follow recommendation model, users within thetop-k ranking results are very active in the context of the specified interest areas bycontributing to popular software repositories. This is the main goal of our authority-based recommendation approach. Thus, we conclude that our model is well suitedfor personalized, context-sensitive follow recommendations.

3.7 Conclusions

Online software development has taken a new path where social networking featurescan be used to discover users and repositories. These features help users to stayup-to-date regarding new development efforts and community activities. Here weproposed a novel follow recommendation approach that is based on the conceptof user authority. Instead of simply matching users by static skill profiles, weproposed a network-centric approach taking a user’s community engagement as wellas social metrics into account. We have systematically derived a mathematicallysound model to measure user authority based on activity (e.g., repository commits)and community reputation (follower degree).


The presented concepts have been evaluated using a real world dataset. Todate, GitHub is the most popular platform offering collaboration features, Wikis,development related tools (e.g., issue tracking) and social networking features suchas following. We have obtained a GitHub-based dataset including the followergraph and relevant user actions. Based on the dataset, we have performed a numberof experiments to test the proposed recommendation approach. Results show thatour personalized follow recommendation approach delivers better recommendationsthan non-personalized recommendations.

In our future work, we will study more fine grained repository actions forfollower recommendations. This includes analyzing the detailed location of changesin the source code as well as details regarding criticality of bugs fixes.

References

1. Github.com, Github website (last access June 2015). URL www.github.com2. G. Madey, V. Freeh, R. Tynan, The open source software development phenomenon: An analy-

sis based on social network theory, in: Americas conf. on Information Systems (AMCIS2002),2002, pp. 1806–1813.

3. F. M. Fisher, R. B. Mancke, J. W. McKie, IBM and the U.S. data processing industry : aneconomic history, Praeger, New York, N.Y., U.S.A. :, 1983.

4. A. W. Kosner, Github is the next big social network, powered by what you do, not who youknow (July 2012). URL http://onforb.es/PX02oJ

5. F. O. Bjørnson, T. Dingsøyr, Knowledge management in software engineering: A systematicreview of studied concepts, findings and research methods used, Inf. Softw. Technol. 50 (11)(2008) 1055–1068.

6. R. Conradi, T. Dybå, An empirical study on the utility of formal routines to transferknowledge and experience, SIGSOFT Softw. Eng. Notes 26 (5) (2001) 268–276. doi:10.1145/503271.503246. URL http://doi.acm.org/10.1145/503271.503246

7. H. Kwak, C. Lee, H. Park, S. Moon, What is twitter, a social network or a news media?, in:Proceedings of the 19th international conference on World wide web, WWW ’10, ACM, NewYork, NY, USA, 2010, pp. 591–600.

8. J. Weng, E.-P. Lim, J. Jiang, Q. He, Twitterrank: finding topic-sensitive influential twitterers,in: Proceedings of the third ACM international conference on Web search and data mining,WSDM ’10, ACM, New York, NY, USA, 2010, pp. 261–270.

9. P. Gupta, A. Goel, J. Lin, A. Sharma, D. Wang, R. Zadeh, Wtf: the who to follow service attwitter, in: Proceedings of the 22nd international conference on World Wide Web, WWW ’13,2013, pp. 505–514.

10. M. J. Brzozowski, D. M. Romero, Who should i follow? recommending people in directedsocial networks, Tech. rep., HP Labs (2011).

11. J. Xu, Y. Gao, S. Christley, G. Madey, A topological analysis of the open source softwaredevelopment community, in: System Sciences, 2005. HICSS ’05. Proceedings of the 38thAnnual Hawaii International Conference on, 2005, p. 198a.

12. J. Paulson, G. Succi, A. Eberlein, An empirical study of open-source and closed-sourcesoftware products, Software Engineering, IEEE Transactions on 30 (4) (2004) 246–256.

13. G. Valetto, M. Helander, K. Ehrlich, S. Chulani, M. Wegman, C. Williams, Using softwarerepositories to investigate socio-technical congruence in development projects, MSR ’07, IEEEComputer Society, Washington, DC, USA, 2007, pp. 25–.

www.github.com

http://onforb.es/PX02oJ

http://dx.doi.org/10.1145/503271.503246

http://doi.acm.org/10.1145/503271.503246

References 57

14. M. Pinzger, N. Nagappan, B. Murphy, Can developer-module networks predict failures?, in:Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of softwareengineering, SIGSOFT ’08/FSE-16, ACM, New York, NY, USA, 2008, pp. 2–12.

15. T. Zimmermann, N. Nagappan, Predicting defects using network analysis on dependencygraphs, in: Proceedings of the 30th international conference on Software engineering, ICSE’08, ACM, New York, NY, USA, 2008, pp. 531–540.

16. P. Bhattacharya, M. Iliofotou, I. Neamtiu, M. Faloutsos, Graph-based analysis and predictionfor software evolution, in: M. Glinz, G. C. Murphy, M. Pezzè (Eds.), ICSE, IEEE, 2012,pp. 419–429.

17. B. Heller, E. Marschner, E. Rosenfeld, J. Heer, Visualizing collaboration and influence in theopen-source software community, in: Proceedings of the 8th Working Conference on MiningSoftware Repositories, MSR ’11, ACM, New York, NY, USA, 2011, pp. 223–226.

18. A. Majumder, S. Datta, K. Naidu, Capacitated team formation problem on social networks, in:Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery anddata mining, KDD ’12, ACM, New York, NY, USA, 2012, pp. 1005–1013.

19. T. Lappas, K. Liu, E. Terzi, Finding a team of experts in social networks, in: Proceedings ofthe 15th ACM SIGKDD international conference on Knowledge discovery and data mining,KDD ’09, ACM, New York, NY, USA, 2009, pp. 467–476.

20. A. Anagnostopoulos, L. Becchetti, C. Castillo, A. Gionis, S. Leonardi, Online team formationin social networks, in: Proceedings of the 21st international conference on World Wide Web,WWW ’12, ACM, New York, NY, USA, 2012, pp. 839–848.

21. G. Gousios, D. Spinellis, Ghtorrent: Github’s data from a firehose, in: M. Lanza, M. D. Penta,T. Xi (Eds.), MSR, IEEE, 2012, pp. 12–21.

22. S. Wasserman, K. Faust, Social Network Analysis: Methods and Applications, CambridgeUniversity Press, Cambridge, 1994.

23. D. J. Watts, S. H. Strogatz, Collective dynamics of ‘small-world’ networks, Nature 393 (6684)(1998) 440–442.

24. M. E. J. Newman, J. Park, Why social networks are different from other types of networks,Phys. Rev. E 68 (2003) 036122.

25. J. Leskovec, E. Horvitz, Planetary-scale views on a large instant-messaging network, in:Proceedings of the 17th international conference on World Wide Web, WWW ’08, ACM, NewYork, NY, USA, 2008, pp. 915–924.

26. M. Cha, A. Mislove, K. P. Gummadi, A measurement-driven analysis of informationpropagation in the flickr social network, in: Proceedings of the 18th international confer-ence on World wide web, WWW ’09, ACM, New York, NY, USA, 2009, pp. 721–730.doi:10.1145/1526709.1526806.

27. D. Schall, Expertise ranking using activity and contextual link measures, Data Knowl. Eng.71 (1) (2012) 92–113.

28. R. Kumar, J. Novak, A. Tomkins, Structure and evolution of online social networks,in: Proceedings of the 12th ACM SIGKDD international conference on Knowledge dis-covery and data mining, KDD ’06, ACM, New York, NY, USA, 2006, pp. 611–617.doi:10.1145/1150402.1150476.

29. D. Nevo, I. Benbasat, Y. Wand, Who knows what? (Oct. 2009). URL http://sloanreview.mit.edu/executive-adviser/2009-4/5147/who-knows-what/

30. A. Mockus, J. D. Herbsleb, Expertise browser: a quantitative approach to identifying expertise,in: Proceedings of the 24th International Conference on Software Engineering, ICSE ’02,ACM, New York, NY, USA, 2002, pp. 503–512. doi:10.1145/581339.581401. URL http://doi.acm.org/10.1145/581339.581401

31. T. Fritz, J. Ou, G. C. Murphy, E. Murphy-Hill, A degree-of-knowledge model to capturesource code familiarity, in: Proceedings of the 32nd ACM/IEEE International Conference onSoftware Engineering - Volume 1, ICSE ’10, ACM, New York, NY, USA, 2010, pp. 385–394.doi:10.1145/1806799.1806856. URL http://doi.acm.org/10.1145/1806799.1806856

32. J. M. Kleinberg, Authoritative sources in a hyperlinked environment, J. ACM 46 (5) (1999)604–632.

http://dx.doi.org/10.1145/1526709.1526806

http://dx.doi.org/10.1145/1150402.1150476

http://sloanreview.mit.edu/executive-adviser/2009-4/5147/who-knows-what/

http://sloanreview.mit.edu/executive-adviser/2009-4/5147/who-knows-what/

http://dx.doi.org/10.1145/581339.581401

http://doi.acm.org/10.1145/581339.581401

http://doi.acm.org/10.1145/581339.581401

http://dx.doi.org/10.1145/1806799.1806856

http://doi.acm.org/10.1145/1806799.1806856


33. L. Page, S. Brin, R. Motwani, T. Winograd, The pagerank citation ranking: Bringing order tothe web, Tech. rep., Stanford Digital Library Technologies Project (1998).

34. T. H. Haveliwala, Topic-sensitive pagerank, in: WWW ’02, ACM, New York, NY, USA, 2002,pp. 517–526.

35. G. Jeh, J. Widom, Scaling personalized web search, in: WWW ’03, ACM, New York, NY,USA, 2003, pp. 271–279.

36. D. Schall, Measuring contextual partner importance in scientific collaboration networks,Journal of Informetrics 7 (3) (2013) 730–736. doi:http://dx.doi.org/10.1016/j.joi.2013.05.003.URL http://www.sciencedirect.com/science/article/pii/S1751157713000461

37. J. Zhang, M. S. Ackerman, L. Adamic, Expertise networks in online communities: structureand algorithms, in: Proceedings of the 16th international conference on World Wide Web,WWW ’07, ACM, New York, NY, USA, 2007, pp. 221–230.

38. L. A. Adamic, J. Zhang, E. Bakshy, M. S. Ackerman, Knowledge sharing and yahoo answers:everyone knows something, in: Proceedings of the 17th international conference on WorldWide Web, WWW ’08, ACM, New York, NY, USA, 2008, pp. 665–674.

39. D. Schall, Service Oriented Crowdsourcing: Architecture, Protocols and Algorithms, SpringerBriefs in Computer Science, Springer New York, New York, NY, USA, 2012.

40. M. McPherson, L. S. Lovin, J. M. Cook, Birds of a feather: Homophily in social networks,Annual Review of Sociology 27 (1) (2001) 415–444.

41. M. S. Granovetter, The strength of weak ties, The American Journal of Sociology 78 (6) (1973)1360–1380.

42. D. Goldberg, D. Nichols, B. M. Oki, D. Terry, Using collaborative filtering to weave aninformation tapestry, Commun. ACM 35 (12) (1992) 61–70. doi:10.1145/138859.138867.URL http://doi.acm.org/10.1145/138859.138867

43. A. Y. Ng, A. X. Zheng, M. I. Jordan, Stable algorithms for link analysis, in: Proceedings of the24th annual international ACM SIGIR conference on Research and development in informationretrieval, SIGIR ’01, ACM, New York, NY, USA, 2001, pp. 258–266.

44. Github.com, Github event types. URL http://developer.github.com/v3/activity/events/types/45. I. Grigorik, Github archive (last access June 2015). URL www.githubarchive.org

http://dx.doi.org/http://dx.doi.org/10.1016/j.joi.2013.05.003

http://www.sciencedirect.com/science/article/pii/S1751157713000461

http://dx.doi.org/10.1145/138859.138867

http://doi.acm.org/10.1145/138859.138867

http://developer.github.com/v3/activity/events/types/

www.githubarchive.org

Chapter 4Partner Recommendation

Abstract In this chapter we present a novel approach for measuring and combingvarious criteria for partner importance evaluation in scientific collaboration net-works. The presented approach is cost sensitive, aware of temporal and context-based partner authority, and takes structural information with regards to structuralholes into account. The applicability of the proposed approach and the effectsof parameter selection are extensively studied using real data from the EuropeanUnion’s research program.

4.1 Social Network-Based Collaboration

Scientific collaboration in an international environment takes place among partnerssuch as organizations, universities or research institutes to jointly perform projects.The main motivation for organizations and individual research groups to collaborateis to enable knowledge and resource sharing to effectively perform research projects.Scientific collaboration can be defined as interaction taking place within a socialcontext among two or more scientists that facilitates the sharing of meaning andcompletion of tasks with respect to a mutually shared, superordinate goal [1].

However, the success of research and innovation is based on the right balancebetween cooperation and competition. Hence, formation of coalitions and consortiais influenced by partner reputation [2], institutional constraints, and mechanismof self-organization [3]. Scientific collaboration can be analyzed at the level ofresearchers through co-authorship and citation networks [4–6] or at the level oforganizations or research institutions [7]. The former has been widely studied byexisting research while the latter lacks a principled approach for selecting andaggregating ranking criteria that may be influenced by context. Generally, scien-tific collaboration and endorsement can be analyzed according to three differentmethods [8]: (1) qualitative methods such as using a questionnaire-based approach,(2) bibliometric methods including publication and citation counting or co-citationanalysis, and (3) complex network methods including network centrality metricssuch as PageRank [9] or Hyperlink Induced Topic Search (HITS) [10]. Here wefocus on the analysis of scientific collaboration at the organizational or institutionallevel. We apply complex network methods to automate the analysis of partnerimportance in scientific collaboration. In this work, importance is a concept that


59

60 4 Partner Recommendation

is governed by multiple factors including average cost of a partner, temporal trendand context of partner authority, and partner importance with regards to effectivesize of the partner’s social network. Effective size in the context of structural holesand social networks means low redundancy among social contacts thereby yieldingcontrol benefits of individuals. Here we apply a similar principle but focus on theorganizational level rather than individuals in social networks.

In our previous work [11] we introduced an approach for measuring contextualimportance in scientific collaboration networks. In this work, we build upon ourprevious work [11] but significantly expand the concepts. Here we provide thefollowing novel key contributions:

• We introduce a personalized partner authority model that is able to capturecontext-dependent and time-aware partner reputation.

• We introduce a model to measure structural importance of organizations embed-ded in scientific collaboration networks. The idea of our structural importancemetric is drawn from the notion of structural holes as established in a sociologicalresearch context.

• To support partner selection using multiple-criteria, the factors contributing toa partner importance are aggregated through a systematic approach to a singlepartner importance ranking score. Here we apply analytic hierarchy process(AHP) to derive the partner importance score.

• We present experimental results by providing a comprehensive study on theinfluence of different parameters using real data from the EU’s Seventh Frame-work Programme (FP7) for research in Information and Communication Tech-nology (ICT).

This chapter is structured as follows. Section 4.2 gives an overview of relatedwork and literature in the context of network formation and network analysis. InSect. 4.3 our personalized partner authority model is introduced. Section 4.4 intro-duces the structural importance model and Sect. 4.5 details the software frameworkand the analytic hierarchy process to compute the final partner importance scores. InSect. 4.6 the evaluation results are presented followed by the conclusion and outlookto future work in Sect. 4.7.

4.2 Background in Strategic Formation

We structure this background section into two basic areas: network formationin the context of collaborative environments and network analysis methods withparticular emphasis on authority ranking. From a technique point of view, manyapproaches found in both network formation and network analysis methods forauthority ranking are based on graph theory and algorithms. In this section, wereview literature in both areas as they will provide the foundation for our work.

4.2 Background in Strategic Formation 61

Network Formation The rapid advancement of ICT-enabled infrastructure hasfundamentally changed how businesses and companies operate. Global marketsand the requirement for rapid innovation demand for alliances between individualcompanies [12]. It is widely agreed that knowledge of the structure of interactionamong individuals or organizations is important for a proper understanding of anumber of important questions such as the spread of new ideas and technologiesand competitive strategies in dynamic markets [13]. Work by [14] investigatedthe evolutionary dynamics of network formation by analyzing how organizationalunits create new linkages for resource exchange. The potential gains from bridgingdifferent parts of a network were important in the early work of Granovetter [15]and are central to the notion of structural holes developed by Burt [16, 17]. Thetheory is based on the hypothesis that individuals can benefit from serving asintermediaries between others who are not directly connected. A formal approach tostrategic formation based on advanced game-theoretic broker incentive techniqueswas presented in [18]. In [19] group formation in social networks is studied.

Network Analysis We propose a model for importance that is based on well-established techniques such as the notion of hubs and authorities [10] and PageR-ank [9]. PageRank can be personalized [9] to estimate node importance with regardsto certain topics [20–22]. After the seminal work of [9] and the far-reachingwork of [21], related research (see also [23]) addressed, for example, efficientcomputation of personalized PageRank [24, 25] and a generalization of personalizedPageRank towards bipartite graphs [26]. In [27], the authors proposed time-aware authority ranking by considering temporal properties of scientific publicationactivity. Our previous work addressed PageRank personalization techniques forexpertise ranking in a social network context [28, 29].

In this work, we propose a new framework which utilizes both information fromstructural holes and authority importance scores to discover valuable collaborationpartners. Here we propose a unified HITS/PageRank model that is able to measurenetwork importance at the individual as well as the organizational or institutionallevel with respect to a certain context. In contrast to existing rankings such asthe Shanghai academic ranking,1 our approach is able to capture importance ata fine grained contextual level. Our approach is able to utilize various additionalranking parameters including desirable partner properties (e.g., high topic-sensitiveauthority) and low undesirable partner properties (e.g., partner costs). At the coreof this framework are link-based algorithms such as HITS and extensions towardspersonalized, time-aware PageRank, structural metrics to measure the brokeragepotential of a given network node, and an analytic hierarchy process (AHP)algorithm [30] to aggregate these metrics into a single ranking score.

1www.shanghairanking.com.

www.shanghairanking.com


The proposed model is tested with data from the ICT research projects havingreceived grants under the EU’s FP7 program. The data as described in [31] andcovers a period from 2007 to 2011.

4.3 Reputation Model

Here we formalize the notion of organization authority as it will be used inour ranking model. Authority is automatically calculated using network analysistechniques. The novelty of the approach is that authority is put into context byconsidering topic information. Well-established models provide the foundationalconcepts and basis. Specifically, we base our approach upon the model of hubs andauthorities as developed by [10].

4.3.1 Basic Definitions

We start with a definition of basic concepts that are used throughout this work. Let usconsider a simple collaboration scenario in a scientific community where individualpartners (e.g., organizations, research institutes, and universities) collaborate in thecontext of research projects.

Figure 4.1 shows a set of organizations o1,o2,o3 and a set of research projectsp1,p2,p3. Each project is associated with a certain topic that determines thecontext of the performed collaboration (for example, “services” or “internet”).Organizations are involved in projects by having certain roles. Roles includeproject coordinator and project partner. In addition to the involvement relation, aweighted edge is created from the project to the organization to depict the degreeof involvement. For example, o1 is involved in projects p1 and p2 with weights

O1

Services internet

O2

O3

p1 p2p3

weightrelation

collaborationrelation

involvementrelation

communitytopic

project

organization

Symbols:

W11

W12W21

W22 W23

W32

W33

a b

Fig. 4.1 Scientific collaboration environment and definitions. (a) Scientific collaboration environ-ment. (b) Legend

4.3 Reputation Model 63

w11 and w21 respectively. In our work, the weight will be based on the funding anorganization receives in the context of a project. More funding typically means thatan organization is able to allocate more (human) resources to the project and therebyperforming more work. Finally, based on joint projects performed by organizationswe model collaboration relations among them. Since o1 and o2 have been involved inthe joint projects p1 and p2, a collaboration relation between o1 and o2 is establishedas a dashed line. Similarly, o2 and o3 have been involved in the joint projects p2 andp3 and therefore a collaboration relation between o2 and o3 is established. Also, acollaboration relation between o1 and o3 exists because they jointly worked on p2.A collaboration relation is a mutual (undirected) edge.

4.3.2 Hubs and Authorities

Let us apply the notion of hubs and authorities to a collaboration environment asdepicted by Fig. 4.1. A project is regarded to be important if the organizationscontributing to it are also regarded to be important (e.g., knowledgeable andreputable). In turn, the importance of an organization is based on its involvement inimportant projects. This is a recursive definition of importance and can be modeledby using the intuitive notion of hubs and authorities as proposed by [10].

A(o) = ∑(p,o)∈EP

H(p) H(p) = ∑(p,u)∈EP

A(u) (4.1)

In the model, an organization o obtains an authority score depicted by A(o) anda project p obtains a hub score denoted by H(p). The drawback of this model isthe “stability” of rankings. A ranking algorithm is stable if the algorithm returnssimilar results upon small disturbances. We follow the randomized HITS approachas proposed in [32] and expand the equations in Eq. (4.1) as follows:

A(o) = (1−λa)δO(o)+λa ∑(p,o)∈EP

H(p) (4.2)

H(p) = (1−λh)δP(p)+λh ∑(p,u)∈EP

A(u) (4.3)

This adjusted model is a natural way of designing a random-walk based algorithmfollowing the HITS model. The randomized HITS approach is, like PageRank, sta-ble to small perturbations [32]. The symbols δO(o) and δP(p) depict personalizationvectors that may be assigned uniformly for each node such that δO(o) = 1

|VO| and

δP(p) = 1|VP| . Non-uniform personalization vectors result in personalized rankings.

The parameters λa and λh with 0≤ λ ≤ 1 allow for balancing between authority/hubweights and personalization weights. A typical value for λ is 0.85 [9]. Assigninglower values to λ means that higher importance is given to the personalizationweights; thereby reducing the “network effect” of the ranking algorithm.


4.3.3 Query-Sensitive Personalization

Let us define the query-sensitive authority score:

A(o;Q) = (1−λa)δO(o;Q)+λa ∑(p,o)∈EP

wpoH(p;Q) (4.4)

Similarly, let us define the query-sensitive hub score:

H(p;Q) = (1−λh)δP(p;Q)+λh ∑(p,u)∈EP

wpuA(u;Q) (4.5)

The edge weights wpo and wpu are based on the organizations’ degree of projectinvolvement. Particularly, the weight wpo is based on the funding received byorganization o in project p and is calculated as

wpo =funding(p,o)

∑v∈adj(p) funding(p,v)(4.6)

where adj(p) depicts the set of nodes adjacent to p (i.e., the set of organizationsinvolved in project p). To compute authority scores using a single equation, whichis the desired goal of our approach, we substitute H(p;Q) in Eq. (4.4) by Eq. (4.5)and have:

A(o;Q) = (1−λa)δO(o;Q)+λa(1−λh) ∑(p,o)∈EP

wpoδP(p;Q)

+λaλh ∑(p,o)∈EP

∑(p,u)∈E

wpowpuA(u;Q)(4.7)

Based on Eq. (4.7), let us define the personalization vector δ ′O(o;Q) as follows:

δ ′O(o;Q) =

1−λa

1−λhδO(o;Q)+λa ∑

(p,o)∈EP

wpoδP(p;Q) (4.8)

If we use the same parameter values for λa and λh [due to symmetry of Eqs. (4.4)and (4.5)] such that λa = λh, Eq. (4.8) simplifies to:

δ ′O(o;Q) = δO(o;Q)+λ ∑

(p,o)∈EP

wpoδP(p;Q) (4.9)

In the following step we rewrite Eq. (4.7) by using the personalization vectorp′(u;Q) as defined in Eq. (4.9).

A(o;Q) = (1−λ )δ ′O(o;Q)+λ 2 ∑

(p,o)∈EP

∑(p,u)∈EP

wpowpuA(u;Q) (4.10)

4.3 Reputation Model 65

As one can see, Eq. (4.10) has a PageRank-like structure. An important conceptfor personalization based on the PageRank model is the linearity theorem asintroduced in [21]. The theorem states that for any personalization vectors δ1,δ2

and weights w1,w2 with w1 +w2 = 1, the following equality holds:

PPV(w1δ1 +w2δ2) = w1PPV(δ1)+w2PPV(δ2) (4.11)

The linearity theorem states that personalized PageRank vectors PPV can becomposed as the weighted sum of PageRank vectors. Equation (4.12) shows how toderive the weighted sum of personalized authority ranking scores using Eq. (4.10).The goal is to obtain a structure as depicted by the right part of Eq. (4.11). Theweight wq is associated with a particular keyword q with wq = 1

|Q| for uniformweights and ∑q wq = 1.

A(o;Q) =(1−λ )δ ′P(o;Q)+λ 2 ∑

(p,o)∈EP

∑(p,u)∈EP

wpowpuA(u;Q)

=(1−λ ) ∑q∈Q

wqδ ′P(o;q)+λ 2 ∑

(p,o)∈EP

∑(p,u)∈EP

∑q∈Q

wqwpowpuA(u;q)

= ∑q∈Q

wq(1−λ )δ ′P(o;q)+ ∑

q∈Q

wqλ 2 ∑(p,o)∈EP

∑(p,u)∈EP

wpowpuA(u;q)

= ∑q∈Q

wq[(1−λ )δ ′

P(o;q)+λ 2 ∑(p,o)∈EP

∑(p,u)∈EP

wpowpuA(u;q)]

= ∑q∈Q

wq[A(o;q)

]

(4.12)

As stated before, the benefit of the model is the ability to precompute authorityscores for particular topics, save them in a database, and aggregate the precomputedauthority scores later at query time. Suppose the set of topics, as extractedby the Topic Analyzer, is given as T = T1,T2, . . . ,Tn. For each topicauthority scores are calculated A(o;T1),A(o;T2), . . . ,A(o;Tn) and utilized by theAuthority Aggregator to compute

A(o;Q) = ∑q∈Q

wqA(o;Tq) (4.13)

where Tq is the topic matching query keyword q. Next, we describe the time-awareauthority model.

4.3.4 Time-Aware Authority

We have extensively discussed the notion of authority and the idea of computingauthority scores for individual topics that can be aggregated at query time. Nowwe turn to the definition of the personalization vectors δP and δO. Recall, δP


holds personalization weights for projects and δO holds personalization weights fororganizations. For δP we use a straightforward model to calculate personalizationweights

δP(p) =funding(p)

∑proj∈VPfunding(proj)

(4.14)

where funding(p) depicts the monetary funding received by project p. For simplicity,we do not consider the query context Q for the project-based personalization vector.

The next discussion is related to the concept of time-aware and topic-basedauthority ranking. Thus, we establish metrics to calculate the personalizationweights of δO. Here topic-based personalization and time-aware weighting isapplied. Recall that a topic is identified by a single keyword. Organizations typicallyperform numerous projects that are related to one or more topic(s). Thus, eachorganization has a set of topics including topic frequency associated with it.Furthermore, frequencies of topics are counted by year. An example for such datawould be (“OrgA”, 2011, “services”, 5) where “OrgA” is the organization name,2011 the specific year, “services” the given topic and the number 5 an exampleof a frequency count. As a first step let us define the weight function WT (o,y;Tx)that obtains the frequency count of organization o in year y for some topic Tx.The frequency count is based on how many projects related to the given topic theorganization has started in the year (i.e., the year when signing the project contract).To establish the notion of positive or negative change in topic specific weights, wedefine the weight deviation function WT

Δ (o,y;Tx) as follows:

WTΔ (o,y;Tx) = WT(o,y;Tx)− 1

|Y| ∑y′∈Y

WT (o,y′;Tx) (4.15)

Deviation in this context means the weight WT(o,y;Tx) in year y minus theaverage weight with regards to topic Tx. Straightforwardly, a positive sign meansincreasing topic-based weight, a negative sign means decreasing topic-based weightas a result of being below the average, and 0 means no change in topic-basedweights (i.e., through constant rate of projects related to topic Tx). This definition isquite simple and captures already a notion of “trend” by analyzing the temporalproject history of an organization. The positive/negative sign shows increasingor decreasing trend. However, WT

Δ (o,y;Tx) just analyzes the trend with respectto organization o without considering the weights and thus performance of otherorganizations. Personalization for authority ranking in collaboration networks mustbe performed by considering weights in relation to all other organizations. Forbrevity, let us define the set α = WT(o1,y;Tx),WT (o2,y;Tx), . . . ,WT(on,y;Tx)with o1,o2, . . . ,on ∈ VO. Let us define the trend Tr(o;Tx) of organization o withrespect to topic Tx as:

Tr(o;Tx) = ∑y∈Y

wy

[WT(o,y;Tx)

max(α)×WT

Δ (o,y;Tx)

]

(4.16)

4.4 Structural Importance Model 67

Tr(o;Tx) is based on the trend for topic Tx over the years Y = y1,y2, . . . ,ynwhere yn is the most recent year, yn−1 the previous year and so forth (ordered byrecency). The first term within the square brackets measures the topic based weightin relation to the community performance in Tx by dividing WT(o,y;Tx) by max(α).For the top-performing organizations having the most numbers of projects related toTx in year y the term becomes 1. The term is multiplied by the organization specificweight deviation function WT

Δ (o,y;Tx). The weight wy puts more emphasis on recentyears (recency factor) by being calculated as wy ∈ 1

|Y| ,1

|Y|−1 ,1

|Y|−2 , . . . ,1.Finally, the personalization vector δO needs to be assigned by matching orga-

nizations having performed projects related to Tx and trend values Tr need to bemapped to a positive interval. This is done because δO represents a probabilitydistribution (for theoretical foundations related to personalized PageRank see, forexample, [20]). Let us define the set β = Tr(o1;Tx),Tr(o2;Tx), . . . ,Tr(on;Tx) witho1,o2, . . . ,on ∈ VO.

δO(o;Tx) =

1− max(β )−Tr(o;Tx)

max(β )−min(β ) , if matches(o;Tx)

0 , otherwise(4.17)

The function matches(o;Tx) checks if o has performed projects related to Tx andevaluates to true or false. To evaluate a query Q = T1,T2 a simple aggregation isperformed

A(o;T1,T2) = w1A(o;T1)+w2A(o;T2) (4.18)

where A(o;T1) is personalized for T1 and A(o;T2) is personalized for T2. In otherwords, both A(o;T1) and A(o;T2) hold topic-based and time-aware authority scoresfor all o ∈ VO.

4.4 Structural Importance Model

The previous section explained in detail the authority model and ranking approach.Here we turn to the second criteria used in our overall ranking model. We define thenotion of structural importance and detail a metric to calculate the importance. Theobtained ranking scores for structural importance are used as a second parameterin the AHP-based aggregation [i.e., the AHP parameter SI(o;Q)]. By followingthe notion of structural holes as coined by Burt [16, 17], structural holes are anopportunity to broker the flow of information between people in an organizationalor social network. As an example, managers often act as information brokers as theytalk to many people in the project.

Structural importance captures the ability of a network node to broker infor-mation between its neighbors (in our context organizations). A node can do soif potential “information gaps” (or buffers) arise in the network. A broker can


also be seen as a mediator that helps establishing communication between othernodes. A project partner with “brokerage” capabilities is often important in projectconsortia to help establish and facilitate communication among other partners. Asan example, a project consortium may be lead by an academic partner who is incharge of coordinating the project from an administrative and scientific point ofview. Typically, exploitation and further use of project results is an important issue inresearch projects. However, the consortium leader may not be the optimal partner fortransferring (or “translating”) scientific results to business. Thus, there may be a gapbetween technical/scientific results and exploitation of results within an industrialcontext (e.g., implementing novel solutions within an industrial environment).With regards to this example, an organization may act as a broker by mediatingcommunication and transferring the knowledge to an industrial partner within theproject.

Thus, structural importance essentially focuses on mediation capabilities ofan organization as opposed to expertise/authority. Such mediators help runningprojects more effectively and efficiently by (a) establishing communication betweenpotentially disconnected network segments that have not communicated before and(b) help making communication more fluid and efficient. To be able to act as abroker, gaps must exist in the network because otherwise a node looses its abilityto establish communication. The notion of redundancy provides means to expressthe existence of such gaps. If there is high redundancy in terms of network edgesand communication paths in a network, the need to fill structural gaps may be verylimited. On the contrary, if a network is highly segmented and only few nodesconnect individual segments, the need for brokers and mediation opportunities maybe very high.

Let us consider a graph as depicted by Fig. 4.2. Here the graph model GOC isused that consists of organizations and collaboration relations as undirected edges.Each node depicts an organization with a,b,c,o,r,u,v,z⊂ VO. A circle surroundsnodes that belong to a particular expertise area or community identified through Aand B. A query may be formulated to match the nodes and edges in either QA or

Fig. 4.2 Network structureto illustrate structuralimportance metric

a

c

o

u

z v

r

QA QB

b

4.4 Structural Importance Model 69

QB or both Q = QA ∪QB = TA,TB. An edge (v,u) ∈ GOC has a weight whichis based on the number of performed projects between v and u. The weight isdynamically assigned depending on the query context Q. For example, the weightof the edge (o,z) ∈ GOC may be different in QA and QB depending on the jointprojects performed by o and z (i.e., if the projects match QA or QB or both). SupposeQ=QA∪QB, the node o has the highest number of non-redundant edges in the graphbecause it connects the node sets a,b,c and u,v,z which are only reachablevia o. Thus, o has a unique position within the network because o is able to controlthe information flow between both node sets. Furthermore, only o and z belong toboth expertise areas A and B but only o is connected to a,b,c in A. Let us defineSI(o;Q) as

SI(o;Q) = ∑u∈N(o)

[

1− ∑v∈N(u)

WQN (o,v;Q)WQ

M(u,v;Q)

]

(4.19)

where v ∈ u,o and N(o) the set of o’s neighbors. For SI(o;Q), we followBurt’s measure of the effective size of a node’s network [17]. Here the notion ofstructural holes is not applied to people-based social networks, but to organizationcollaboration networks (i.e., GOC). Conceptually, the effective size is the number ofnodes o is connected to, minus the redundancy in the network.

In contrast to Burt’s definition of effective size, we compute structural impor-tance with respect to the query Q. As an example, while o in Fig. 4.2 is structurallyimportant in Q = QA ∪QB to establish a flow between a,b,c and u,v,z, o isless significant if only QB is considered. Actually, within QB u has a unique positionbecause r is only reachable via u.

The weight WQN (o,v;Q) in Eq. (4.19) shows the query-sensitive normalized edge

weight between o and v and is calculated as

WQN (o,v;Q) = ∑

q∈Q

wqov

∑u∈N(o)wqou

(4.20)

where wqov is the weight associated with (o,v) ∈ EO and calculated as the number

of joint projects between o and v matching the query keyword q. Furthermore, theweight WQ

M(u,v;Q) in Eq. (4.19) depicts the query-sensitive marginal edge weightbetween u and v and is calculated as follows:

WQM(u,v;Q) = ∑

q∈Q

wquv

max(wqun|∀n ∈ N(u)) (4.21)

The marginal weight of u with neighbor v is the weight wquv (also based on the

number of matching joint projects between them) divided by u’s strongest weightwith anyone of its neighbors N(u). If none of the projects match q, the weight isassigned to wq

uv = 0.


4.5 Framework and Ranking Algorithm

4.5.1 Software Framework

As already outlined before, our solution approach to support multi-criteria partnerselection in scientific communities utilizes heavily graph-based models. Graph-based models are widely used in complex- and social-network analysis. Figure 4.3shows the solution framework as a layered view.

Data Management The layer underneath the top-layer shows the data managementthat is responsible for retrieval of project relevant data, managing the needed graphstructures to perform analysis and ranking, and persistence management of analysisand ranking results. From the top-layer (Offline analysis) point of view, the datamanagement can be access via the Data Manipulation Handler in a CRUD(Create-Read-Update-Delete) manner. The Data Provider offers read access tograph structures and offline mining and ranking results. The Project Databasecontains information such as organizations, projects, project involvements, roles,funding, project descriptions, date of project contracts, and project duration. Letus define some basic graph structures that are obtained from information in theProject Database and then managed in the Graph Database.

The Corporate Policies database can only be queried but not modified bythe framework. Essentially, it contains white and black list information with respectto preferred or denied partners. The lists are influenced by mid to long-term businessstrategy.

Based on projects, organizations, and involvement relations we define twotypes of graphs. First, let us define the directed project-organization graphGPO(VP,VO,EP) that is composed of the set projects VP and the set of organizationsVO (VP and VO depicting the vertices in the graph) and the project involvementrelations denoted by the edge set EP where an edge (p,o) ∈ EP points from theproject p to organization o. Each edge (p,o)∈ EP has a weight wpo associated with itdepending on the funding the organization o receives in project p divided by the totalproject funding. This type of graph is being used for organization authority ranking.Let us define the second type of graph as the undirected organization-collaborationgraph GOC(VO,EO) consisting of the set of organizations VO depicting the verticesin the graph and the set of collaboration relations EO. Whereas the edges EP in GPO

are based on project involvement relations pointing from projects to organizations,the edges EO in GOC are undirected and connect two organizations. This type ofgraph is being used for structural importance ranking. Details regarding these twotypes of graph structures will be provided in the following.

Offline Analysis This layer deals with components that are invoked in an offlinemanner (e.g., triggered by changes in the Project Database). The TopicAnalyzer extracts relevant topics from project descriptions by filtering stop wordsand combining synonyms to single topics. Essentially, each topic is identified bya single keyword that has a frequency associated with it to identify popularity of

4.5 Framework and Ranking Algorithm 71

Dat

a m

anag

emen

tO

fflin

e an

alys

isO

nlin

e an

alys

isQ

uer

y p

roce

ssin

g

Topic Analyzer Authority Ranker

Data Provider

AHP Ranker

Structural RankerCost Ranker

Query Frontend

Graph Database

Project Database

Analysis & RankingDatabase

Authority Aggregator

Components for offline analysis

Data management and persistence

Legend:

Components for query processing and matching

Components for online analysis and aggregation

Data flow (simplified)

Trend Analyzer

Data Manipulation Handler Corporate Policies

Fig. 4.3 Software framework

topics. Typical topics in the context of ICT research are, for example, “services” and“internet”. Sophisticated topic models such as cross-topic relations or hierarchicalstructures are not within the focus of this work (e.g., see [29] for hierarchicaltopic models and topic clustering techniques). The Topic Analyzer saves topicinformation in the Analysis & Ranking Database. If new topics are added,other components such as the Authority Ranker are triggered (see later).


The Trend Analyzer calculates trends with regards to organizations’ activitiesin topics. The historical project information is used to calculate a trend in increasingor decreasing number of projects for given topics. In a steady state (neitherincreasing nor decreasing activity), the trend equals 0. Trend information is utilizedby the Authority Ranker to create topic and time-aware authority scores.The detailed mechanisms will be discussed in later sections. The AuthorityRanker calculates numeric values for authority scores. To explain the notion ofauthority as used in this work, participation of an organization in a research project(i.e., involvement relation) is understood as a carrier of authority. By being involvedin certain projects, we assume that organizations develop knowledge with regardsto the projects’ topic(s). An organization is considered to be an authority if it hasextensive or specialized knowledge about a topic. In other words, an organizationmust have collaborated in the context of a topic to be considered as an authority fora given topic.

Online Analysis Previously, the offline analysis components were responsiblefor preparing the information needed to perform online analysis and ranking.Our decision to divide functionality into online and offline analysis was due tocomputational complexity of link-based authority ranking algorithms. Computationof authority at query time without having performed offline computation wouldresult in unacceptable response times at magnitudes of hours or even longer.All information from the previous steps is made available in the Analysis& Ranking Database. The Cost Ranker is a simple ranker that providesa scoring function based on organizations’ average costs. The StructuralRanker calculates numeric values for structural importance scores. The idea ofour structural importance metric is drawn from the notion of structural holes asestablished in a sociological context. To detail the difference between structuralimportance and authority, the notion of authority captures the importance ofan organization with regards to knowledge drawn from past project experience.Structural importance captures a different notion of importance that is based onthe lack of information flow and connectedness of parts of the network. As statedby Burt [16], structural holes are an opportunity to broker the flow of informationbetween people and control the projects that bring together people from oppositesides of the hole. Here the notion of structural holes is not applied to people-basedsocial networks, but to organization collaboration networks (i.e., GOC). The goalof our ranking approach is to identify those organizations who have the ability tobridge structural holes and to allow for the emergence of novel innovative ideasthrough brokerage of information. The Authority Aggregator combines theauthority results of offline computed authority scores.

Query Processing Suppose a coordinator attempts to establish a new consortiumand thus wants to find collaboration partners who are able to join the consortium.Often, previous collaborators are known from first hand collaboration experiencebut in today’s vibrant and fast-paced research environment it is also useful to seethe current community standing of known collaboration partners and to discoverpotential new collaborators. The coordinator is able to specify a keyword-based

4.5 Framework and Ranking Algorithm 73

query Q = q1,q2, . . . ,qn (using the Query Frontend) with the goal of findingmatching organizations that are ranked according to a set of criteria (i.e., cost,structural importance and authority). The idea of our ranking approach is to computeranking scores with respect to certain areas of expertise. The demanded areas ofexpertise are specified via the query Q and matched with topics. Each query keywordqn corresponds to a desired area of expertise. A query returns a ranked list oforganizations based on the demanded set of expertise areas. The AHP Rankeris used to create a composite ranking score S(o;Q) of organization o. The scoreS(o;Q) is given as

S(o;Q) = AHP(A(o;Q),SI(o;Q),C(o)) (4.22)

where A(o;Q) is the organization’s authority score and SI(o;Q) the structuralimportance score with respect to the query Q, and C(o) the cost score. The followingsection shows the calculation steps.

4.5.2 Ranking Algorithm

Here we discuss the computation of the final ranking score. Recall that thecomposite ranking score S(o;Q) of organization o is obtained through the functionAHP(A(o;Q),SI(o;Q),C(o)). Previously we have defined the authority A(o;Q) andthe structural importance SI(o;Q). Cost C(o) is calculated as the average fundingorganization o receives:

C(o) =1

num_projects(o) ∑(p,o)∈EP

funding(p,o) (4.23)

The final aggregation and computation of a composite ranking score is doneusing the AHP algorithm. AHP is a technique for making complex decisionsin a structured way. AHP has been successfully applied in a number of fieldsincluding transportation [33], maintenance and configurations [34], and servicequality assessment [33]. The theoretical background will not be covered in thiswork since AHP is a well explored technique. We refer the reader to [30] for detailsregarding AHP as a decision making technique.

Algorithm 4 shows the main steps at a high level. The input of the algorithm isgiven as the query Q and the organization-collaboration graph GOC. The graph GOC

is used to compute the structural importance scores in an online manner. Next fouressential steps are performed: (1) create map with criteria input scores, (2) setupAHP, (3) perform AHP ranking, and (4) assign final AHP ranking scores to outputmap.

First, the ranking criteria scores are obtained as described in the previous sections(authority Sect. 4.3 and structural importance Sect. 4.4 respectively). These include


Algorithm 4 Multi-criteria ranking algorithmInput: The query Q and the undirected organization-collaboration graph GOC.Compute:

1. Create map for org with individual scores. For each organization o ∈ VO do:

• A(o)← au_score(o,Q)) // authority• SI(o)← si_score(o,GOC ,Q) // struct. imp.• C(o)← avg_cost(o) // cost• Add to map (o,A(o),SI(o),C(o))

2. Setup AHP attributes weights and desirability.

• Auth. attributes (“authority′′ ,wau,+1)• Struct. attributes (“structure′′ ,wsi,+1)• Cost attributes (“cost′′,wcost,−1)

3. Perform AHP ranking using output from previous steps.

• Compute the vector of criteria weights.• Compute the matrix of organizations scores.• Rank the organizations.

4. Assign final AHP ranking scores to map S. For each organization o ∈ VO do:

• S(o)← ahp_score(o)) // final score

Output: Ranked organizations based on query Q and according to composite AHP rankingscore.

authority, structural importance and cost. Using a map, each criteria score isassociated with an organization. The map generated in this step is passed as anargument to the AHP ranking in step 3.

Second, AHP attributes are setup by assigning the weights wau,wsi,wcost to eachcriteria with [∑w w] = 1. In addition, the desirability attribute is assigned to denote ifa certain criteria is desired or not. In particular, authority and structural importanceshould be high (desirability = +1) to obtain a better AHP ranking score whereascost should be low to obtain a better ranking score (desirability =−1).

Third, AHP ranking is performed by using the previously setup attributes and theoutput map of step 1. The step 3 of Algorithm 4 is decomposed into the followingsteps:

• Compute the vector of criteria weights: In this step rating of the relative priorityof the criteria is done by assigning a weight value to the more important criteria.The weight values are taken from the previous step of the algorithm (step 2). Theweight assignment is done through a pairwise comparison of the criteria. Afterthat, the resulting weights are normalized and the average is computed for eachcriteria.

• Compute the matrix of organizations scores: Here the score for each organizationis determined by computing how well organization o meets some criteria Y.Afterwards, the organizations’ scores are normalized and averaged.

4.6 Evaluation 75

• Rank the organizations: In a final step the organizations’ scores are combinedwith the criterion weights to produce an overall score for each organization. Theextent to which the organizations satisfy the criteria is weighted according tothe relative importance of the criteria. The final score is simply computed as aweighted sum.

Note, in our case criteria are contrasting by demanding that organizations shouldhave high authority but low cost. In general, the organization that is recommendedfor selection (top-ranked in final output S) is not necessarily the one which optimizeseach single criterion, but rather the organization which achieves the most suitabletrade-off among the different criteria. This behavior makes AHP a very flexible andpowerful tool for multi-criteria partner selection.

Forth, the AHP scores are saved in a final score map S. Organizations are rankedin descending order by ranking score.

4.6 Evaluation

Here the evaluation of the proposed concepts and model is presented. We haveselected a dataset of a scientific collaboration environment to test the concepts.

4.6.1 Description of Dataset

The data is based on ICT research projects having received grants under the EU’sSeventh Framework Programme (FP7). The data as described in detail in [31] andcovers a period from 2007 to 2011. Research projects have multiple partners and anorganization can be the partner of multiple projects. To date, the FP7 ICT programhas allocated funding to 1,469 projects for a total Union funding of 4,979,301,152Euro. This results in 14,781 participations by 4,718 distinct legal entities. TheFig. 4.4 (Source: European Commission2) shows a clear geographic concentration,across and within the member states.

Our evaluation is performed as follows. First we select the top-20 organizations(ranked by degree and given in Table 4.2) and compute metrics for those 20organizations with regards to popular topics. This evaluation is called top-k rankevaluation and is presented in Sect. 4.6.2. Second we compute cross topic rankingstatistics such as overlap similarity and Kendall’s τ rank difference. This evaluationis presented in Sect. 4.6.3 along with the definition of relevant ranking metrics.

Table 4.1 gives an overview of popular (project) topics extracted from projectinformation (see [31] for details). Frequency is measured by counting appearance

2http://observatory.euroris-net.eu/euroris/files/download/261-316.

http://observatory.euroris-net.eu/euroris/files/download/261-316


> EUR 50 mln.

> EUR 20 mln.

> EUR 10 mln.

Fig. 4.4 Location of organisations the top 50 for EC funding from FP7-ICT

Table 4.1 Popular projecttopics and frequencies

Topic Frequency

Systems 4126

Internet 2729

Networks 1771

Services 1247

Software 1224

Health 1115

Embedded 1054

Transport 890

Efficiency 849

Energy 849

of the topic string within project names and project short descriptions of each projectpartner involvement record (association of organization to project including receivedfunding). In total, we extracted 170 topics after performing some automatic andmanual processing of the data. Table 4.1 shows the top-10 topics with the highestfrequencies among the 170 topics.

4.6 Evaluation 77

Table 4.2 Top-20 organizations ranked by degree

PNr Name Cost Degree Struct. rank score

1 Fraunhofer-Gesellschaft zur Foerderung derAngewandten Forschung E.V

524,515 272 516

2 Centre National De La Recherche Scien-tifique

159,983 153 175

3 Commissariat A L Energie Atomique EtAux Energies Alternatives

235,911 137 201

4 Ecole Polytechnique Federale De Lausanne 290,827 97 146

5 Consiglio Nazionale Delle Ricerche 455,650 96 154

6 Valtion Teknillinen Tutkimuskeskus 293,240 95 190

7 Institut National De Recherche En Informa-tique Et En Automatique

799,995 94 127

8 Interuniversitair Micro-Electronica CentrumVzw

964,195 90 140

9 Eidgenoessische Technische HochschuleZurich

389,544 90 89

10 Telefonica Investigacion Y Desarrollo Sa 636,818 76 131

11 Katholieke Universiteit Leuven 711,085 69 98

12 SAP AG 1221,665 68 168

13 Universidad Politecnica De Madrid 331,315 65 136

14 Atos Origin Sociedad Anonima Espanola 628,296 62 215

15 Imperial College Of Science, TechnologyAnd Medicine

286,930 61 56

16 Politecnico Di Milano 531,975 61 98

17 Kungliga Tekniska Hoegskolan 415,684 59 85

18 Technische Universiteit Delft 622,286 58 88

19 Karlsruher Institut Fuer Technologie 250,599 56 76

20 Technische Universitaet Wien 287,363 55 87

4.6.2 Top-k Rank Evaluation

Table 4.2 shows the top-20 organizations ranked by their degree in GPO (project-organization graph). The first column (PNr column) is a unique key associated withan organization and used throughout this section to identify a top-20 organization.The second column (Name column) shows the organizations’ legal name. The thirdcolumn (Cost column) shows the average organization cost using Eq. (4.23). Theorganization indegree (Degree column) is analog to the project count as projectsp ∈ VP point to organizations o ∈ VO. The degree-based rank will be used as abaseline ranking. This baseline results will be compared with AHP-based rankings.We have selected the degree-based rank as a baseline algorithm to show the impactof personalization based on topic information and time-aware authority ranking.Notice, the degree-based rank has no topic bias. In addition, the degree-based rankwas selected because it already captures some notion of importance with regards to


organization reputation. The last column (Structural Rank Score column) showsthe structural rank score [using Eq. (4.19)] over all topics in Table 4.1. A higherscore is better.

It is noticeable that the organization 14 has a particular high structural rankscore in relation to its degree-based rank position. Organization 1 has an exceptionalhigh structural rank score but has also the most projects within the ICT frameworkprogram. Notice, however, the degree is calculated using GPO and the structural rankusing GOC.

To compare AHP results with the rankings in Tables 4.2, 4.3 and 4.4 list detailedmetrics for selected topics. We have selected eight out of the ten topics fromTable 4.1 due to space reasons. Each metric is computed for each topic in Tables 4.3and 4.4 respectively. Using GPO, the degree D is based on matching projects only.Projects are matched against the given topic as depicted in the headings of Tables 4.3and 4.4. The authority A is calculated for respective topics. The position P is therank position index as obtained by the AHP rank using Eq. (4.22).

AHP is setup with the weights 0.4 for authority, 0.2 for the structural importancerank and 0.4 for cost. Thus, authority and cost are given slightly higher weights thanstructural importance. We regard authority as highly desirable but at the same timecost should be kept at an acceptable level. After that structural importance is alsoa desirable property but not equally important as the other criteria. However, sinceour approach is flexible weights can be adjusted as demanded.

The position change Ch is computed between degree-based ranking positionsand authority based ranking positions in the following manner

Ch(o) = pos(A(o;Tx))− pos(degree_rank(o)) (4.24)

where pos() retrieves the position index by ranking score. This lets us show howrankings are influenced by authority. Finally, the trend Tr is computed by using theEq. (4.16). As state before, the sign has the following meaning:

Tr =

⎧⎨

⎩

positive sign , if trend is increasingnegative sign , if trend is decreasing0 , otherwise

To show the relationship between two metrics, at the bottom of Tables 4.3and 4.4 we show the correlation coefficient among various metrics. As usual,the correlation coefficient takes a value between [−1,1], with 1 or −1 indicatingperfect correlation. A positive correlation shows a positive association betweenthe variables. Thus, increasing values of one variable correspond to increasingvalues of the other variable. On the other hand, negative correlation indicates anegative association between the variables. Thus, increasing values of one variablecorrespond to decreasing values of the other variable. A correlation value close to 0indicates no association between the variables.

4.6 Evaluation 79

Tab

le4.

3To

p-20

list

ofor

gani

zati

ons:

topi

csin

clud

e“n

etw

orks

”,“s

yste

ms”

,“so

ftw

are”

,and

“ser

vice

s”

Net

wor

ksSy

stem

sSo

ftw

are

Serv

ices

PNr

DA

PC

hT

rD

AP

Ch

Tr

DA

PC

hT

rD

AP

Ch

Tr

129

0.11

12

1.08

700.

861

07.

977

0.15

23

0.23

110.

045

28−0

.03

210

0.03

3167

7−0

.13

390.

0016

196

−0.0

33

0.04

1227

0.00

30.

0415

350.

00

320

0.05

1225

0.19

440.

105

40.

791

0.04

720

0.00

10.

0410

280.

00

45

0.03

3066

5−0

.05

260.

0119

870.

011

0.04

1627

0.00

20.

0417

340.

00

54

0.05

1627

0.17

190.

153

−11.

446

0.06

106

0.04

70.

0411

130.

06

612

0.07

65

0.56

200.

058

150.

391

0.04

919

0.00

70.

0225

630

−0.1

8

716

0.09

5−2

0.85

250.

0810

50.

737

0.04

1844

−0.0

19

0.04

2261

8−0

.02

86

0.04

2761

0.01

270.

146

−31.

210

0.01

4235

70.

001

0.04

1925

0.00

97

0.04

3231

0.10

330.

0912

−10.

871

0.04

2918

0.00

10.

0434

270.

00

1038

0.00

681

671

−0.5

56

0.01

2667

0.06

110.

0137

353

−0.0

712

0.04

1816

0.03

113

0.03

5766

1−0

.05

230.

0518

80.

482

0.04

2525

0.00

20.

0432

320.

00

127

0.07

72

0.51

60.

0115

560.

0617

0.22

3−1

00.

4318

0.17

2−1

01.

25

136

0.06

136

0.38

120.

0122

670.

0610

0.03

2034

9−0

.03

110.

0063

662

5−0

.34

149

0.10

3−1

01.

065

0.02

938

0.12

70.

055

10.

0311

0.11

3−9

0.72

150

0.00

689

667

0.00

190.

0442

130.

381

0.04

5728

0.00

20.

0459

310.

00

163

0.04

4488

0.00

220.

0714

−30.

665

0.05

19−2

0.03

50.

0430

160.

02

179

0.04

4248

0.04

190.

0913

−80.

831

0.04

3421

0.00

10.

0437

250.

00

183

0.03

6064

2−0

.03

250.

157

−15

1.48

00.

0010

134

80.

002

0.04

3626

0.00

192

0.04

6584

0.00

170.

0249

260.

181

0.04

4533

0.00

10.

0447

330.

00

203

0.04

5296

0.00

170.

0240

300.

153

0.04

326

0.01

50.

0613

−90.

20

D1.

00.

20.

30.

00.

11.

00.

8−0

.4−0

.10.

81.

00.

6−0

.40.

10.

61.

00.

50.

20.

30.

5

A1.

0−0

.6−0

.71.

01.

0−0

.4−0

.31.

01.

0−0

.5−0

.51.

01.

0−0

.4−0

.41.

0

P1.

00.

6−0

.51.

00.

2−0

.41.

00.

5−0

.41.

00.

5−0

.3

Ch

1.0

−0.6

1.0

−0.3

−0.3

0.0

1.0

−0.4


Tab

le4.

4To

p-20

list

ofor

gani

zati

ons:

topi

csin

clud

e“h

ealt

h”,“

embe

dded

”,“i

nter

net”

,and

“ene

rgy”

Hea

lth

Em

bedd

edIn

tern

etE

nerg

y

PNr

DA

PC

hT

rD

AP

Ch

Tr

DA

PC

hT

rD

AP

Ch

Tr

112

0.05

322

0.07

110.

122

100.

5147

0.04

311

0.47

170.

431

00.

56

27

0.06

109

0.26

120.

234

11.

2214

0.03

2410

40−0

.21

00.

0025

609

0.00

33

0.06

1114

0.19

150.

273

−11.

3823

0.03

833

0.12

30.

0510

120.

01

45

0.07

95

0.30

80.

0423

410.

048

0.03

2111

4−0

.02

40.

0314

240.

00

58

0.06

1411

0.21

10.

0325

490.

0011

0.03

1536

0.13

30.

0412

140.

01

64

0.04

2037

0.00

30.

0319

460.

0115

0.05

65

0.67

140.

087

30.

07

76

0.03

3769

9−0

.05

120.

215

−21.

0727

0.05

70

0.83

20.

0319

240.

00

80

0.00

714

706

0.00

100.

0520

250.

116

0.03

2065

0.01

00.

0049

602

0.00

99

0.02

710

703

−0.1

611

0.02

5950

6−0

.05

90.

0333

570.

051

0.03

3220

0.00

101

0.04

2834

0.00

20.

0333

480.

0054

0.00

1044

1034

−1.4

12

0.03

1720

0.00

115

0.04

2515

0.07

20.

0341

480.

005

0.03

2952

0.07

20.

0326

230.

00

120

0.00

713

703

0.00

50.

0417

280.

0828

0.09

2−1

02.

606

0.21

2−1

00.

27

1312

0.06

130

0.25

50.

0714

70.

2618

0.03

2361

0.04

20.

0316

220.

00

144

0.04

1938

0.00

00.

0024

510

0.00

200.

121

−13

4.42

50.

049

60.

01

159

0.11

4−1

10.

675

0.06

297

0.23

10.

0377

950.

003

0.06

18−2

0.05

164

0.04

3432

0.00

90.

0432

230.

108

0.03

2733

0.12

30.

0324

160.

00

170

0.00

720

699

0.00

90.

0528

140.

1510

0.03

3952

0.05

00.

0016

359

70.

00

183

0.04

3112

0.04

100.

1510

−90.

724

0.03

4288

0.00

00.

0013

759

50.

00

193

0.04

4440

0.00

60.

1112

−50.

543

0.03

6012

00.

004

0.05

21−5

0.02

204

0.06

15−1

00.

2711

0.02

6049

4−0

.04

80.

0352

1004

−0.0

42

0.03

3635

0.00

D1.

00.

6−0

.4−0

.30.

41.

00.

7−0

.2−0

.10.

61.

00.

10.

60.

20.

11.

00.

8−0

.4−0

.40.

8

A1.

0−0

.8−0

.80.

91.

0−0

.7−0

.41.

01.

0−0

.4−0

.41.

00.

8−0

.4−0

.40.

8

P1.

00.

9−0

.41.

00.

7−0

.71.

00.

6−0

.41.

00.

7−0

.3

Ch

1.0

−0.5

1.0

−0.4

1.0

−0.4

1.0

−0.2

4.6 Evaluation 81

Table 4.3 shows the results for the topics “networks”, “systems”, “software”,and “services”. The organization PNr 1 has been ranked by AHP at position 1 in“networks”, position 1 in “systems”, position 2 in “software”, and position 5 in“services”. With regards to “services”, a negative trend is shown for PNr 1 andthus the position has dropped in this topic. In the other topics, positive trend can beobserved and thus the ranking position was mostly preserved. With regards to thetopic “systems”, a very good trend of 7.97 can be observed and highest authorityscore of 0.86 within the table. As one can see, by applying our approach, muchmore fine-grained ranking can be performed by considering topic information.

With regards to correlation, A always correlates perfectly with Tr because time-aware authority takes trend through personalization in account. D shows goodcorrelation with Tr in the topic “systems”. This is a result of the broad scope of“systems” and the high frequency of the topic within projects (see also Table 4.1).

Table 4.4 shows the results for the topics “health”, “embedded”, “internet”, and“energy”. The organization PNr 1 was only ranked in “energy” at position 1 butnot for the other topics. One exceptional high change in the ranking position canbe seen for organization PNr 10 in “internet” which ranks by AHP at 1044. PNr 10had some substantial amounts of projects with regards to “internet” in the past (54matching projects as indicated by D) but the trend is highly negative (Tr is −1.41,which is the lowest in the table) and time-aware A is 0.00. Thus, we believe thatnegative trend and limited recent activity in the context “internet” justifies a changein the rank position.

With regards to correlation, A only correlates perfectly in “embedded” and“internet” but not for the other topics (although a high correlation is still achieved).D shows good correlation with Tr in the topic “energy”. Like in Table 4.3, A showsgood correlation with P. Indeed, authority is part of AHP’s ranking criteria so acorrelation can be expected. Recall, higher authority yields better positions. Thus,negative correlation means increasing values of authority correspond to decreasingvalues of the rank position (lower position value is better).

Based on the data in Tables 4.3 and 4.4, average values of degree, position,and change are depicted in Fig. 4.5 and average values of authority and trend areshown in Fig. 4.6. Average values are based on the metric values of the top-20 list oforganizations. Again, the baseline algorithm for ranking is the degree-based rank.

The topic “networks” has the highest average value with regards to change. Thus,AHP rankings based on topic information have significant impact on the rankingposition of organizations and a lot of changes are observed within the top-20 list.Also the topics “health” and “internet” yield high changes on average. However,only “health” yields also high average values with regards to position. This meansthat organizations ranked within top-20 positions by the degree-based rank would beranked at much higher positions by AHP in the “health” topic. As mentioned before,since “systems” is a very broad topic also the positions by AHP are quite similarwhen compared with the degree-based rank (the lowest average value as depicted byFig. 4.5). The average degree does not significantly change across topics. Generally,topic based personalization has the effect that significant changes of rank positioncan be expected.


0.00

50.00

100.00

150.00

200.00

250.00

netw

orks

syst

ems

softw

are

serv

ices

heal

th

embe

dded

inte

rnet

ener

gy

Degree Position Change

Fig. 4.5 Average degree, position, and change

0.000.100.200.300.400.500.600.700.800.901.00

netw

orks

syst

ems

softw

are

serv

ices

heal

th

embe

dded

inte

rnet

ener

gy

Authority Trend

Fig. 4.6 Average authority score and trend

Next, Fig. 4.6 shows the average values for authority and trend. The topic“systems” shows the highest average authority and the highest average trend. Thisobservation is also consistent with the previous discussion. The topic “software”

4.6 Evaluation 83

shows the lowest trend and also a low average value for authority. In general,deviations in authority across topics is very high.

To summarize the main observations in this section:

• Our proposed model enables more fine-grained ranking by considering topicinformation.

• Authority correlates to a high degree with trend because time-aware authoritytakes trend through personalization into account.

• Generally, topic based personalization has the effect that significant changes ofrank position can be observed.

• Topics that play a role in many projects (having a broad scope) correlate betterwith degree-based ranking. Thus, no significant changes through personalizationcan be expected.

• As a consequence of the previous observation, by focusing on narrow and morespecialized topics organizations with fewer projects are able to build up authorityand are thereby ranked at better positions in those topics.

4.6.3 Statistical Comparison

Here a statistical comparison of ranking techniques is performed. In the previoussection, a top-20 list of organizations was selected (as ranked by the organizations’degree) and evaluated by using different metrics. In this section we use a set overlapand distance based ranking metric to compare the AHP based results with non-personalized rankings including the degree-based rank, a funding based rank, andthe structural rank.

The funding based rank uses the total amount of funding received by anorganization to perform ranking (the higher the total funding the better the rank).The structural importance rank is used in isolation of AHP and compared with theregular AHP using the criteria authority, structural importance, and cost. After thata cross topic comparison is performed by using AHP and authority based rankingsand AHP-based rankings personalized for different topics. AHP is setup with theweights 0.4 for authority, 0.2 for the structural importance rank and 0.4 for cost.

To systematically compare results of two ranking algorithms, let us define twostandard ranking metrics.

OSim@k To measure similarity of top-k sets, let us define overlap similarity asfollows:

OSim@k =Ok1 ∩Ok2

k(4.25)

OSim@k defines the overlap similarity of the top-k sets Ok ranked by algorithm1 and algorithm 2. Each set consists of organizations such that Ok ⊂ VO. The firstalgorithm is always AHP, which has been parameterized using the same weights asdefined previously.


Kendall’s τ The next ranking metric used in this work is the well-known Kendall’sτ metric (for example, see [28]):

Kendall’s τ =2(num_concordant− num_disconcordant)

|VO|(|VO|− 1)(4.26)

Consider the pair of nodes o,u. The pair is concordant if two rankings agreeon the order and disconcordant if both rankings disagree on the order. Denotethe number of these pairs by num_concordant and num_disconcordant respectively.The total number of pairs is given as |VO|(|VO|−1)

2 . Kendall’s τ is defined between theinterval τ ∈ [−1,1]. Kendall’s τ helps analyzing if two ranking algorithms are ranksimilar. If τ equals 1, there are no cases where the pair o,u is ranked in a differentorder.

Table 4.5 shows the comparison results of AHP-based rankings (for the top-10topics in Table 4.1) and the degree-based, funding-based, and structural importancerank. The highest values for OSim and Kendall’s τ are depicted as bold-facenumbers. The topic “systems” clearly shows the highest overlap with the other(non-topic based) rankings. OSim@10, OSim@20, and OSim@50 show the highestoverlap in each topic. This observation is again in line with the previous discussion.Previously “systems” showed the highest average authority and the highest averagetrend within the top-20 list of organizations. The structural importance rank showsthe highest overlap of 0.70 in the top-10 segment (depicted as OSim@10). However,an higher agreement in the rank order as measured through Kendall’s τ is given inthe topic “software”. Kendall’s τ is calculated by using the whole list of rankedorganizations. Whereas the highest overlap of AHP-based rankings with the degree-based, funding-based, and structural importance rank is given in “systems”, higheragreement in terms of Kendall’s τ is given in “software”.

Figure 4.7 shows the comparison results of AHP-based rankings (again for thetop-10 topics in Table 4.1) and the authority-based rankings. Here, for each topicranking is performed using AHP as defined in Eq. (4.22) and authority as definedin Eq. (4.18). The results are then compared using OSim and Kendall’s τ . Furtherdetails are provided in Table 4.6. The first set of rows (1–10) shows OSim@10, thesecond set of rows (11–20) depicts OSim@20, the third set of rows (21–30) depictsOSim@50, and the forth set of rows (31–40) shows Kendall’s τ .

The values below the matrix diagonal (from top-left to bottom right corner) are allset to 0 because of symmetry. For example, overlap similarity OSim for the topics“networks” and “systems” yields the same results as “systems” and “networks”.At the diagonal values comparison of AHP and authority rankings for the sametopic was performed. Thus, high overlap and agreement with regards to OSim andKendall’s τ , respectively, can be observed.

Figure 4.8 shows the average values of OSim@10, OSim@20, OSim@50, andKendall’s τ for each topic. With regards to OSim@10, “health” yields the lowestaverage overlap similarity. The topics “efficiency” and “energy” have the highestoverlap similarities in the top-10 segment.

4.6 Evaluation 85

Tab

le4.

5O

Sim

and

Ken

dall

’sτ

for

com

pari

son

ofA

HP

wit

hde

gree

,fun

ding

,and

stru

ctur

alra

nk

Net

wor

ksSy

stem

sSo

ftw

are

Serv

ices

Tra

nspo

rtE

ffici

ency

Hea

lth

Em

bedd

edIn

tern

etE

nerg

y

Deg

ree

OSi

m@

100.

300.

600.

400.

200.

300.

300.

300.

400.

400.

30

OSi

m@

200.

400.

750.

550.

500.

400.

550.

500.

500.

400.

55

OSi

m@

500.

560.

860.

720.

740.

700.

780.

660.

680.

700.

78

Ken

dall

’sτ

0.44

0.42

0.46

0.45

0.44

0.41

0.41

0.45

0.43

0.41

Fund

ing

OSi

m@

100.

300.

500.

500.

300.

300.

400.

300.

300.

400.

40

OSi

m@

200.

400.

700.

500.

550.

400.

550.

400.

400.

600.

55

OSi

m@

500.

560.

840.

760.

720.

700.

760.

740.

660.

700.

76

Ken

dall

’sτ

0.63

0.54

0.65

0.64

0.60

0.58

0.61

0.61

0.62

0.58

Stru

ctur

alO

Sim

@10

0.40

0.70

0.60

0.40

0.40

0.60

0.30

0.30

0.50

0.60

OSi

m@

200.

400.

700.

650.

550.

450.

600.

400.

450.

600.

60

OSi

m@

500.

560.

860.

800.

820.

800.

840.

660.

720.

680.

84

Ken

dall

’sτ

0.48

0.47

0.51

0.49

0.49

0.46

0.46

0.50

0.46

0.46


Fig. 4.7 OSim and Kendall’sτ for comparison of AHPwith authority-based rankings(detailed numbers areavailable in Table 4.6)

240

35

30

25

20

15

10

5

1

4 6 8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure 4.9 shows the comparison results of AHP-based rankings for the top-10topics in Table 4.1 across topics. This comparison shows how ranking results changeby considering different topics. Further details are provided in Table 4.7.

Rows are segmented in the same manner as already described previously. Valuesat the matrix diagonal (from top-left to bottom right corner) are all 1. The valuesbelow the matrix diagonal are all set to 0 for the before mentioned reason.

Figure 4.10 shows the average values of OSim@10, OSim@20, OSim@50, andKendall’s τ for each topic. In OSim@10 the topic “health” results in the lowestaverage overlap similarity followed by the topic “embedded”, which has also lowoverlap similarity. In general, higher average values of OSim@10, OSim@20,OSim@50 as well as Kendall’s τ can be observed when compared with the previousdiscussion. Higher values are the result of the same ranking technique being used(AHP-based rankings) and results being compared across topics. Before the AHP-based rankings were compared with authority, which is only one of the rankingcriteria being used in AHP.

Overall, the overlap in the top-10 segment is on average 42 %, in the top-20segment 49 %, and in the top-50 segment 69 %. This means that around six outof ten organizations in the top-10 would be ranked differently across topics. Thus,personalization using topic information has a strong impact on ranking results. Forthe topic “health”, for example, it has the largest impact with average OSim@10being 23 %. We observe changes of more than 49 % in some topics by looking atOSim@10.

4.7 Conclusions

This work introduced various metrics for importance ranking in scientific collabo-ration environments. We proposed a novel topic-sensitive authority model that isbased on well-establish ranking techniques. We systematically derived a unified

4.7 Conclusions 87

Tab

le4.

6O

Sim

and

Ken

dall

’sτ

for

com

pari

son

ofA

HP

wit

hau

thor

ity-

base

dra

nkin

gs

Net

wor

ksSy

stem

sSo

ftw

are

Serv

ices

Tra

nspo

rtE

ffici

ency

Hea

lth

Em

bedd

edIn

tern

etE

nerg

y

OSi

m@

10N

etw

orks

0.80

0.10

0.30

0.30

0.10

0.30

0.10

0.10

0.80

0.30

Syst

ems

0.00

0.70

0.10

0.10

0.20

0.30

0.00

0.30

0.20

0.30

Soft

war

e0.

000.

000.

600.

400.

100.

300.

000.

100.

300.

30

Serv

ices

0.00

0.00

0.00

0.80

0.00

0.20

0.00

0.10

0.40

0.20

Tra

nspo

rt0.

000.

000.

000.

000.

800.

300.

000.

100.

000.

30

Effi

cien

cy0.

000.

000.

000.

000.

000.

800.

000.

100.

200.

80

Hea

lth

0.00

0.00

0.00

0.00

0.00

0.00

0.80

0.20

0.10

0.10

Em

bedd

ed0.

000.

000.

000.

000.

000.

000.

000.

900.

100.

10

Inte

rnet

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.70

0.30

Ene

rgy

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.80

OSi

m@

20N

etw

orks

0.85

0.25

0.35

0.40

0.15

0.30

0.20

0.20

0.70

0.30

Syst

ems

0.00

0.75

0.25

0.20

0.20

0.35

0.25

0.40

0.30

0.35

Soft

war

e0.

000.

000.

650.

350.

200.

350.

250.

300.

400.

35

Serv

ices

0.00

0.00

0.00

0.65

0.10

0.30

0.30

0.20

0.40

0.30

Tra

nspo

rt0.

000.

000.

000.

000.

750.

400.

150.

200.

200.

40

Effi

cien

cy0.

000.

000.

000.

000.

000.

750.

250.

200.

250.

75

Hea

lth

0.00

0.00

0.00

0.00

0.00

0.00

0.85

0.30

0.20

0.30

Em

bedd

ed0.

000.

000.

000.

000.

000.

000.

000.

850.

250.

25

Inte

rnet

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.75

0.35

Ene

rgy

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.75

(con

tinu

ed)


Tab

le4.

6(c

onti

nued

)

Net

wor

ksSy

stem

sSo

ftw

are

Serv

ices

Tra

nspo

rtE

ffici

ency

Hea

lth

Em

bedd

edIn

tern

etE

nerg

y

OSi

m@

50N

etw

orks

0.74

0.40

0.48

0.46

0.26

0.42

0.24

0.26

0.70

0.42

Syst

ems

0.00

0.78

0.56

0.56

0.38

0.64

0.48

0.52

0.38

0.64

Soft

war

e0.

000.

000.

720.

640.

320.

540.

420.

400.

480.

54

Serv

ices

0.00

0.00

0.00

0.70

0.36

0.52

0.42

0.38

0.50

0.52

Tra

nspo

rt0.

000.

000.

000.

000.

580.

540.

400.

380.

340.

54

Effi

cien

cy0.

000.

000.

000.

000.

000.

760.

460.

420.

340.

76

Hea

lth

0.00

0.00

0.00

0.00

0.00

0.00

0.70

0.38

0.28

0.54

Em

bedd

ed0.

000.

000.

000.

000.

000.

000.

000.

740.

280.

54

Inte

rnet

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.66

0.52

Ene

rgy

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.76

Ken

dall

’sτ

Net

wor

ks0.

880.

430.

610.

580.

520.

510.

510.

560.

770.

51

Syst

ems

0.00

0.82

0.52

0.47

0.47

0.48

0.50

0.61

0.43

0.48

Soft

war

e0.

000.

000.

840.

710.

550.

570.

550.

610.

660.

57

Serv

ices

0.00

0.00

0.00

0.87

0.59

0.52

0.54

0.55

0.64

0.52

Tra

nspo

rt0.

000.

000.

000.

000.

830.

520.

470.

530.

500.

52

Effi

cien

cy0.

000.

000.

000.

000.

000.

850.

460.

520.

490.

85

Hea

lth

0.00

0.00

0.00

0.00

0.00

0.00

0.87

0.52

0.48

0.44

Em

bedd

ed0.

000.

000.

000.

000.

000.

000.

000.

850.

530.

51

Inte

rnet

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.89

0.47

Ene

rgy

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.85

4.7 Conclusions 89

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

netw

orks

syst

ems

softw

are

serv

ices

tran

spor

t

effic

ienc

y

heal

th

embe

dded

inte

rnet

ener

gy

OSim@10 OSim@20 OSim@50 Kendall's Tau

Fig. 4.8 Average values of OSim@10, OSim@20, OSim@50, and Kendall’s τ based on Table 4.6

Fig. 4.9 OSim and Kendall’sτ for comparison ofAHP-based rankings acrosstopics (detailed numbers areavailable in Table 4.7)

240

35

30

25

20

15

10

5

1

4 6 8 10

0.1

0

0.2

0.3

0.4

0.5

0.6

0.8

0.7

0.9

1

HITS/PageRank-based model that can be fully personalized. The second metricmeasures organizations’ structural importance based on the notion of structuralholes. In our approach structural importance is computed with respect to certaintopics of interest. Thus, structural importance helps identifying organizations thatmay be valuable partners for strategic alliances. Combined with authority, thisprovides a powerful approach for ranking and discovering new partners. Finally,authority and structural importance are systematically combined with cost. For thatpurpose we utilize AHP to achieve a trade-off among various ranking criteria. Theproposed approach delivers very good results and provides more accurate, topic-sensitive results when compared with other ranking techniques.


Tab

le4.

7O

Sim

and

Ken

dall

’sτ

for

com

pari

son

ofA

HP-

base

dra

nkin

gsac

ross

topi

cs

Net

wor

ksSy

stem

sSo

ftw

are

Serv

ices

Tra

nspo

rtE

ffici

ency

Hea

lth

Em

bedd

edIn

tern

etE

nerg

y

OSi

m@

10N

etw

orks

1.00

0.40

0.40

0.40

0.20

0.40

0.20

0.20

0.80

0.40

Syst

ems

0.00

1.00

0.50

0.30

0.40

0.50

0.10

0.40

0.50

0.50

Soft

war

e0.

000.

001.

000.

600.

300.

500.

100.

200.

600.

50

Serv

ices

0.00

0.00

0.00

1.00

0.20

0.40

0.10

0.20

0.60

0.40

Tra

nspo

rt0.

000.

000.

000.

001.

000.

400.

100.

200.

300.

40

Effi

cien

cy0.

000.

000.

000.

000.

001.

000.

100.

200.

501.

00

Hea

lth

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.30

0.20

0.10

Em

bedd

ed0.

000.

000.

000.

000.

000.

000.

001.

000.

300.

20

Inte

rnet

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.50

Ene

rgy

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

OSi

m@

20N

etw

orks

1.00

0.35

0.45

0.40

0.30

0.40

0.35

0.30

0.65

0.40

Syst

ems

0.00

1.00

0.55

0.50

0.45

0.45

0.40

0.55

0.50

0.45

Soft

war

e0.

000.

001.

000.

600.

400.

550.

400.

350.

550.

55

Serv

ices

0.00

0.00

0.00

1.00

0.35

0.40

0.35

0.30

0.55

0.40

Tra

nspo

rt0.

000.

000.

000.

001.

000.

500.

300.

350.

350.

50

Effi

cien

cy0.

000.

000.

000.

000.

001.

000.

400.

300.

451.

00

Hea

lth

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.35

0.30

0.40

Em

bedd

ed0.

000.

000.

000.

000.

000.

000.

001.

000.

350.

30

Inte

rnet

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.45

Ene

rgy

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

4.7 Conclusions 91

OSi

m@

50N

etw

orks

1.00

0.58

0.62

0.58

0.50

0.52

0.40

0.46

0.76

0.52

Syst

ems

0.00

1.00

0.74

0.74

0.68

0.78

0.68

0.78

0.72

0.78

Soft

war

e0.

000.

001.

000.

820.

700.

720.

600.

640.

740.

72

Serv

ices

0.00

0.00

0.00

1.00

0.72

0.74

0.60

0.62

0.72

0.74

Tra

nspo

rt0.

000.

000.

000.

001.

000.

720.

540.

600.

620.

72

Effi

cien

cy0.

000.

000.

000.

000.

001.

000.

680.

660.

621.

00

Hea

lth

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.58

0.54

0.68

Em

bedd

ed0.

000.

000.

000.

000.

000.

000.

001.

000.

560.

66

Inte

rnet

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.62

Ene

rgy

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

Ken

dall

’sτ

Net

wor

ks1.

000.

560.

720.

670.

630.

620.

600.

660.

860.

62

Syst

ems

0.00

1.00

0.62

0.54

0.57

0.57

0.57

0.70

0.50

0.57

Soft

war

e0.

000.

001.

000.

820.

670.

680.

650.

730.

740.

68

Serv

ices

0.00

0.00

0.00

1.00

0.70

0.62

0.63

0.65

0.72

0.62

Tra

nspo

rt0.

000.

000.

000.

001.

000.

640.

570.

640.

580.

64

Effi

cien

cy0.

000.

000.

000.

000.

001.

000.

550.

630.

561.

00

Hea

lth

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.62

0.56

0.55

Em

bedd

ed0.

000.

000.

000.

000.

000.

000.

001.

000.

600.

63

Inte

rnet

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.56

Ene

rgy

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00


0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

netw

orks

syst

ems

softw

are

serv

ices

tran

spor

t

effic

ienc

y

heal

th

embe

dded

inte

rnet

ener

gy

OSim@10 OSim@20 OSim@50 Kendall's Tau

Fig. 4.10 Average values of OSim@10, OSim@20, OSim@50, and Kendall’s τ based onTable 4.7

In our future work we will study the application of online formation algo-rithms [35] to scientific collaboration networks to suggest competitive alliances andconsortia. The metrics used in the formation algorithm to rank partners will be basedon the techniques as presented in this work.

References

1. D. H. Sonnenwald, B. Cronin, Anonymous, Scientific collaboration: A synthesis of challengesand strategies, in: Annual Review of Information Science and Technology, Vol. 4th, Informa-tion Today, 2007, pp. 2–37.

2. F. Fu, C. Hauert, M. A. Nowak, L. Wang, Reputation-based partner choice promotes coopera-tion in social networks, Phys. Rev. E 78 (2008) 026117. doi:10.1103/PhysRevE.78.026117.

3. C. S. Wagner, L. Leydesdorff, Network structure, self-organization, and the growthof international collaboration in science, Research Policy 34 (10) (2005) 1608–1618.doi:10.1016/j.respol.2005.08.002.

4. Y. Ding, Scientific collaboration and endorsement: Network analysis of coauthorship and cita-tion networks, Journal of Informetrics 5 (1) (2011) 187–203. doi:10.1016/j.joi.2010.10.008.

5. R. Guns, Y. Liu, D. Mahbuba, Q-measures and betweenness centrality in a collaborationnetwork: a case study of the field of informetrics, Scientometrics 87 (1) (2011) 133–147.doi:10.1007/s11192-010-0332-3. URL http://dx.doi.org/10.1007/s11192-010-0332-3

6. M. E. J. Newman, Coauthorship networks and patterns of scientific collaboration, Proceedingsof the National Academy of Sciences of the United States of America 101 (Suppl 1)(2004) 5200–5205. arXiv:http://www.pnas.org/content/101/suppl.1/5200.full.pdf+html, doi:10.1073/pnas.0307545100.

http://dx.doi.org/10.1103/PhysRevE.78.026117

http://dx.doi.org/10.1016/j.respol.2005.08.002

http://dx.doi.org/10.1016/j.joi.2010.10.008

http://dx.doi.org/10.1007/s11192-010-0332-3

http://dx.doi.org/10.1007/s11192-010-0332-3

http://arxiv.org/abs/http://www.pnas.org/content/101/suppl.1/5200.full.pdf+html

http://dx.doi.org/10.1073/pnas.0307545100

References 93

7. N. Lavrac, P. Ljubic, T. Urbancic, G. Papa, M. Jermol, S. Bollhalter, Trust modeling fornetworked organizations using reputation and collaboration estimates, Systems, Man, andCybernetics, Part C: Applications and Reviews, IEEE Transactions on 37 (3) (2007) 429–439.doi:10.1109/TSMCC.2006.889531.

8. S. Milojevil, Modes of collaboration in modern science: Beyond power laws and preferentialattachment, Journal of the American Society for Information Science and Technology 61 (7)(2010) 1410–1423. doi:10.1002/asi.v61:7.

9. L. Page, S. Brin, R. Motwani, T. Winograd, The pagerank citation ranking: Bringing order tothe web, Tech. rep., Stanford University (1998).

10. J. Kleinberg, Authoritative sources in a hyperlinked environment, Journal of the ACM 46(1999) 668–677.

11. D. Schall, Measuring contextual partner importance in scientific collaboration networks,Journal of Informetrics.

12. L. M. Camarinha-Matos, H. Afsarmanesh, Collaborative networks, in: PROLAMAT, 2006,pp. 26–40.

13. S. Goyala, F. Vega-Redondo, Structural holes in social networks, Journal of Economic Theory137 (1) (2007) 460–492.

14. W. Tsai, Social capital, strategic relatedness, and the formation of intra-organizational strategiclinkages, Strategic Management Journal 21 (9) (2000) 925–939.

15. M. S. Granovetter, The strength of weak ties, The American Journal of Sociology 78 (6) (1973)1360–1380.

16. R. S. Burt, Structural holes: The social structure of competition., Harvard University Press,1992.

17. R. S. Burt, Structural holes and good ideas, American Journal of Sociology 110 (2) (2004)349–399.

18. J. Kleinberg, S. Suri, E. Tardos, T. Wexler, Strategic network formation with structural holes,SIGecom Exchanges 7 (3).

19. L. Backstrom, D. Huttenlocher, J. Kleinberg, X. Lan, Group formation in large social networks:membership, growth, and evolution, in: Proceedings of the 12th ACM SIGKDD internationalconference on Knowledge discovery and data mining, KDD ’06, ACM, New York, NY, USA,2006, pp. 44–54.

20. T. Haveliwala, S. Kamvar, G. Jeh, An analytical comparison of approaches to personalizingpagerank, Tech. rep., Stanford University (2003).

21. T. H. Haveliwala, Topic-sensitive pagerank, in: Proceedings of the 11th international con-ference on World Wide Web, WWW ’02, ACM, New York, NY, USA, 2002, pp. 517–526.doi:10.1145/511446.511513. URL http://doi.acm.org/10.1145/511446.511513

22. G. Jeh, J. Widom, Scaling personalized web search, in: Proceedings of the 12th internationalconference on World Wide Web, WWW ’03, ACM, New York, NY, USA, 2003, pp. 271–279.doi:10.1145/775152.775191. URL http://doi.acm.org/10.1145/775152.775191

23. P. Berkhin, Survey: A survey on pagerank computing., Internet Mathematics 2 (1) (2005)73–120.

24. S. Chakrabarti, Dynamic personalized pagerank in entity-relation graphs, in: Proceedings ofthe 16th international conference on World Wide Web, WWW ’07, ACM, New York, NY,USA, 2007, pp. 571–580. doi:10.1145/1242572.1242650. URL http://doi.acm.org/10.1145/1242572.1242650

25. D. Fogaras, K. Csalogany, B. Racz, T. Sarlos, Towards scaling fully personalized pagerank:Algorithms, lower bounds, and experiments, Internet Mathematics 2 (3) (2005) 333–358.

26. H. Deng, M. R. Lyu, I. King, A generalized co-hits algorithm and its application to bipartitegraphs, in: Proceedings of the 15th ACM SIGKDD international conference on Knowledgediscovery and data mining, KDD ’09, ACM, New York, NY, USA, 2009, pp. 239–248.doi:10.1145/1557019.1557051. URL http://doi.acm.org/10.1145/1557019.1557051

27. K. Berberich, M. Vazirgiannis, G. Weikum, T-rank: Time-aware authority ranking, in:S. Leonardi (Ed.), Algorithms and Models for the Web-Graph, Vol. 3243 of Lecture Notesin Computer Science, Springer Berlin Heidelberg, 2004, pp. 131–142.

http://dx.doi.org/10.1109/TSMCC.2006.889531

http://dx.doi.org/10.1002/asi.v61:7

http://dx.doi.org/10.1145/511446.511513

http://doi.acm.org/10.1145/511446.511513

http://dx.doi.org/10.1145/775152.775191

http://doi.acm.org/10.1145/775152.775191

http://dx.doi.org/10.1145/1242572.1242650

http://doi.acm.org/10.1145/1242572.1242650

http://doi.acm.org/10.1145/1242572.1242650

http://dx.doi.org/10.1145/1557019.1557051

http://doi.acm.org/10.1145/1557019.1557051


28. D. Schall, Expertise ranking using activity and contextual link measures, Data Knowl. Eng.71 (1) (2012) 92–113.

29. D. Schall, Service Oriented Crowdsourcing: Architecture, Protocols and Algorithms, SpringerBriefs in Computer Science, Springer New York, New York, NY, USA, 2012.

30. T. L. Saaty, Decision making with the analytic hierarchy process, International Journal ofServices Sciences 1 (2008) 83–98.

31. F. Munisteri, ICT statistical report for annual monitoring 2011, http://ec.europa.eu/digital-agenda/sites/digital-agenda/files/stream_2012_0.pdf (Feb. 2012).

32. A. Y. Ng, A. X. Zheng, M. I. Jordan, Stable algorithms for link analysis, in: Proceedings of the24th annual international ACM SIGIR conference on Research and development in informationretrieval, SIGIR ’01, ACM, New York, NY, USA, 2001, pp. 258–266.

33. P. Ferrari, A method for choosing from among alternative transportation projects, EuropeanJournal of Operational Research 150 (1) (2003) 194–203.

34. A. Certa, M. Enea, T. Lupo, Electre iii to dynamically support the decision maker aboutthe periodic replacements configurations for a multi-component system, Decis. Support Syst.55 (1) (2013) 126–134.

35. A. Anagnostopoulos, L. Becchetti, C. Castillo, A. Gionis, S. Leonardi, Online team formationin social networks, in: Proceedings of the 21st international conference on World Wide Web,WWW ’12, ACM, New York, NY, USA, 2012, pp. 839–848. doi:10.1145/2187836.2187950.



http://dx.doi.org/10.1145/2187836.2187950

Chapter 5Social Broker Recommendation

Abstract In this chapter we propose novel socially-based models for the composi-tion of Professional Virtual Communities (PVC). We focus on the notion of brokerswho act as intermediaries between separated communities. We introduce a brokerdiscovery and ranking approach utilizing a link-based broker importance model. Weevaluate our approach through a service-oriented testbed and real community dataobtained from the European Union’s FP7 research program.

5.1 Virtual Organizations

The rapid advancement of ICT-enabled infrastructure has fundamentally changedhow businesses and companies operate. Global markets and the requirement forrapid innovation demand for alliances between individual companies [9]. VirtualOrganizations (VO) are an important concept in such dynamic environments. Basedon [18], a virtual organization can be defined as follows: an inter-organizationalvirtual organization is a temporary network organization, consisting of independententerprises (organizations, companies, institutions, or specialized individuals) thatcome together swiftly to exploit an apparent market opportunity. The enterprisesutilize their core competencies in an attempt to create a best-of-everything organi-zation in a value-adding partnership, facilitated by Information and CommunicationTechnology (ICT). As such, virtual organizations act in all appearances as a singleorganizational unit.

Web services and service-oriented computing offer well established standardsand techniques to model and implement interactions spanning multiple organi-zations. Collaborative service-based systems are typically knowledge intensivecovering complex interactions between people and software services. In suchecosystems, flexible interactions commonly take place in different organizationalunits. The challenge is that top-down composition models are difficult to apply inconstantly changing and evolving service-oriented collaboration system. There aretwo major obstacles hampering the establishment of seamless communications andcollaborations across organizational boundaries:

• the dynamic discovery and composition of resources, people and services, and• flexible interactions between people located in different departments or

companies.


95

96 5 Social Broker Recommendation

Principles found in social network theory are promising candidate techniquesto assist in the formation process and to support flexible and evolving interactionpatterns in cross-organizational environments. In social networks, relations andinteractions typically emerge freely and independently without restricted pathsand boundaries. Research in social sciences has shown that the resulting socialnetwork structures allow for relatively short paths of information propagation (thesmall-world phenomenon [46]). While this is true for autonomously forming socialnetworks, the boundaries of collaborative networks are typically restricted due toorganizational units and fragmented areas of expertise.

We propose social network principles to bridge segregated collaborative net-works. The theory of structural holes is based on the idea that individuals can benefitfrom serving as intermediaries between others who are not directly connected [7, 8].Thus, such intermediaries can potentially broker information and aggregate ideasarising in different parts of a network [21, 38, 40]. The novelty of the present workis the combination of social principles for the discovery of brokers and service-oriented collaboration infrastructure through Human-Provides Services (HPS) [34,39]. HPS supports a flexible interaction model suitable for cross-organizationalcollaboration.

In this chapter, the following key contributions are presented:

• We introduce concepts and techniques for the discovery of brokers in virtualcommunities. Here we present SBQL, a query language for discovering andranking brokers. SBQL evolved from our initial definition of social networkquery language (see [40]). SBQL provides a better integration into our miningand ranking framework and allows for fuzzy matches with ranked results, whichwas not possible in our previous query language.

• In this work we present a novel link-based broker importance model. Theimportance model is based on the idea of hubs and authorities in Web-basedenvironments as introduced by [20]. Our proposed model can be personalized toaccount for topic-sensitive rankings.

• We performed various experiments and tests including performance tests of ourBrokerQL implementation using an integrated Web services test environment. Inaddition, we obtained data from the EU FP7 ICT research program (see [27]) totest the broker ranking approach in a real virtual collaboration environment.

To address the challenges related to broker discovery and compositions in PVCs,we apply the following overall methodology:

• We model a PVC as a community consisting of experts who interact andcollaborate by the means of ICT to perform work. Service-oriented technologies(Web services, RESTful services, etc.) provide the technical infrastructure toperform collaborations.

5.2 Background in Distributed Organizations 97

• The emergence and evolution of social trust is modeled by considering variousmetrics. In contrast to a common security perspective on trust, we define socialtrust to rely on the interpretation of previous collaboration behavior. Consideringsocial trust is essential to effectively guide interactions.

• We define patterns for broker discovery in virtual communities. The notion ofbrokers is derived from the well-established theory of structural holes. We definethe persistent exogenous interaction pattern and the triadic exogenous interactionpattern.

• To actually discover brokers in VOs and PVCs, we define the BrokerQL languageand its syntax. A set of BrokerQL applications and examples are discussed toillustrate the features of BrokerQL.

• BrokerQL includes a novel broker importance ranking algorithm that is mathe-matically modeled and discussed in depth. Thus, not only matching of relevantbrokers is performed, but also ranking based on social/collaborative networkinformation.

• All presented concepts are implemented in software and tested through richexperiments. In addition, the quality of broker rankings is validated by usinga real dataset reflecting collaborations in virtual environments.

This chapter is structured as follows. Discussion regarding existing literature ispresented in Sect. 5.2. In Sect. 5.3, we introduce supporting concepts to realize flex-ible interactions and the selection of brokers. In Sect. 5.4, we present a motivatingscenario for the discovery of brokers. We introduce the social broker query languagein Sect. 5.5 followed by the definition of the broker ranking model in Sect. 5.6. Ourevaluation results are discussed in Sect. 5.7. Finally, the chapter is concluded inSect. 5.8.

5.2 Background in Distributed Organizations

Here we review background literature and cluster discussions into relevant topics.We start with discussions related to virtual organizations and communities.

• Virtual Organizations (VO) can be studied from various angles. From amanagement point of view, the concept of virtual organizations that are supportedby ICT has been widely studied (for example, see [9]). The central goal ifthis work is to study VOs and the formation thereof from a social networkpoint of view. One of the most interesting ideas in the social sciences is thenotion that entities are embedded in webs of social relations and interactions [6].The same holds true for today’s VO landscape where a globally distributedorganizations form temporary alliances to work on joint projects. Collaborativenon-hierarchical business networks enable, for example, the execution of com-plex product manufacturing processes [42]. Formation of globally distributedteams or VOs is an important and challenging problem [4]. It has been reportedthat, for example, team assembly mechanisms determine both the structure of the


collaboration network and performance for teams [14]. In addition, social modelfor socio-technical performance have been presented [47]. Thus, it is important toselect a formation strategy carefully according to constraints and objectives [29].An interesting application of socially-based formation principles is social productdevelopment in VOs [5].

• Structural Holes. Here we adopt the theory of structural holes as developedby [7]. The theory is based on the idea that individuals can benefit from servingas intermediaries between others who are not directly connected [8]. Suchintermediaries can potentially broker information and aggregate ideas arising indifferent parts of a network [21]. A number of studies have shown that structuralholes positively relate to a range of social success indicators [1, 7, 8, 31]. Lou andTang [23] define the problem of mining top structural hole spanners in large-scalesocial networks.

• Social Trust. In addition to the notion of brokers in collaborative networks, webuild upon research in the area of social trust. A wide range of computationaltrust models have been proposed by, for example, [2, 19, 26]. Here we focus onsocial trust [11, 44, 48] that relies on user interests and collaboration behavior. Asocial network of people connected by trust relations is a fundamental buildingblock in many of today’s most successful e-commerce and recommendationsystems [13].

• Human Provided Services. Based on social principles such as brokers and socialtrust, we provide technical concepts to support the discovery and ranking of bro-kers. We go beyond social concepts only and propose the (semi-)automatic for-mation of broker-based communities and service-oriented interactions. Service-oriented concepts help to rapidly setup and operate VOs. Here the concept ofHuman-Provided Services (HPS) [34, 39] is adopted which supports flexibleservice-oriented collaboration across multiple organizations and domains. HPSnot only enables flexible interactions but also provides the base infrastructurefor interaction mining and link-based ranking techniques [35, 36]. In contrast toour previous work, we propose a domain specific query language that is gearedtowards the requirements of broker discovery and ranking. Here we introduceBrokerQL, which is a query language specifically suited for broker discoveryin socially-based collaboration networks. BrokerQL is based on an SQL-stylesyntax targeted at social and collaborative networks.

• Querying Social Network Data. To date, to the best of our knowledge there is nogeneric solution to query, rank, and select brokers for assembling VOs or groupsin social networks. Some proposals for graph databases (e.g., GraphDB [15])have features to deal with social network data [41], but lack advanced featuresfor broker discovery. SPARQL [45] is a generic language to query semantic datasuch as RDF graphs, but lacks some important capabilities. SPARQL (1) does notsupport negation (“A does not know B”), which is fundamental in VO formation(competition or conflicting interests), (2) expressing network properties such aspath-length is not straightforward, (3) predicates cannot have properties (e.g. “Atrusts B with trust level high”) and finally (4) fuzzy matches with ranked resultis difficult. A query language for social networks has been proposed in [32].

5.3 Hybrid Compute Environment 99

The language in [32] has some similarities with SBQL (e.g., path functions),however, without supporting the discovery of complex sub communities basedon metrics and interaction mining techniques.

5.3 Hybrid Compute Environment

We adopt various concepts to realize the aforementioned flexible collaborationcommunities, and consider various mechanisms to enable brokering of requests,including flexible collaboration models, automatic management of social trustrelations and modeling of broker patterns and behavior.

5.3.1 Human-Provided Services

An activity model attempts to structure loosely coupled collaborations in service-oriented systems (see [34] for details). Examples of collaborative activities atvarious levels of granularity are “sending emails”, “reviewing a paper”, “organizinga workshop”, and “managing a multi-national research project”. A single activityprovides the basic collaborative data. It describes the work to be done and liststhe involved people. This data is often sufficient in static collaborative settingswhere all members are aware of the overall working environment. In dynamic anddistributed collaborative environments, one has to explicitly model the embeddingof a single activity in the overall collaboration context. The context containsthe structure of activities, dependencies between activities, the temporal flow ofactivities, and history of activity changes. This provides the core structure ofcollaborative work. In addition, the collaboration context describes the involvementof members, their roles, required and applied skills, work artifacts, and resources.An expressive activity model needs to support such relations. Existing activity-based tools provide limited means to connect activities, members, and resources.Web services play a fundamental role in supporting flexible, cross-organizationalcollaboration scenarios.

To support human interactions in a service-oriented manner, we have designedand implemented the HPS framework [34]. HPS enhances the traditional “SOA-triangle” approach by enabling people to provide services using the very sametechnology as implementations of software-based services (SBS) use. In contrastto technologies such as BPEL4People [3], HPS treats people in SOA as first-class citizens letting people define their own services and provide these servicesto the community. HPS closely resembles a crowdsourcing approach [36] where thecollective power of people is used to solve problems that services implemented insoftware cannot yet solve.

The three essential steps performed when using the HPS framework are illus-trated in the following. The conceptual model is also depicted by Fig. 5.1.


(3) unified interactions

Providers

HPSSBS

Requesters

SBS

Service and App Marketplace

(1) publish services (2) discover services

Fig. 5.1 HPS discovery and interaction model overview: (1) publish HPS or SBS to serviceregistry, (2) search and discover HPS and/or SBS, and (3) interact with service provider

1. Publish services. The user can create an HPS and publish the service in a Serviceand App Marketplace. Publishing a service is as simple as posting a blog entryon the Web. It is the association of the user’s profile with an activity depicted asa service.

2. Discover services. The requester can perform a search query to find Human-Provided Services or Apps that are based on Human-Provided Services. Recom-mendations are given to find the most relevant HPS based on, for example, theexpertise of the user providing the service [35].

3. Unified interactions. The framework supports both “machine interactions”between SBS and HPS (a human computation scenario) and human interactionsbetween a human requester and an HPS. Apps offer the interfaces to supportsuch kind of interactions.

5.3.2 Emergence and Evolution of Social Trust

In contrast to a common security perspective on trust, we define social trust to relyon the interpretation of previous collaboration behavior and additionally considerthe similarity of dynamically adapting interests [11, 44]. Especially in collaborativeenvironments, where users are exposed to higher risks than in common socialnetwork scenarios [10], and where business is at stake, considering social trust isessential to effectively guide interactions [24].

5.3 Hybrid Compute Environment 101

In this work, we define social trust as follows [11, 12, 26, 44]:

Trust reflects the expectation one actor has about another’s future behavior to perform givenactivities dependably, securely, and reliably based on experiences collected from previousinteractions.

Not only service interactions, but also human interactions may rely on standardprotocols such as SOAP (e.g., see HPS [34] and BPEL4People [3]). These protocolsare well supported by a wide variety of software frameworks. This fact enables theadoption of various available monitoring and logging tools to observe interactions inservice-oriented systems. Various metrics can be calculated by analyzing interactionlogs. Relation metrics describe the links between actors by accounting for (1)recent interaction behavior, (2) profile similarities (e.g., interest or skill similarities),(3) social and/or hierarchical structures (e.g., role models). However, we arguethat social trust relations largely depend on personal interactions. We model acommunity of actors with their social relations as a directed graph, where the nodesdenote network members, and edges reflect (social) relations between them. Sinceinteraction behavior is usually not symmetric, actor relations are represented bydirected links.

The fundamental approach to automatic interaction-based trust inference isdepicted in Fig. 5.2. As motivated in the introduced use case, people interact toperform their tasks. This work is modeled as activities, that describe the typeand goal of work, temporal constraints, and used resources. As interactions takeplace in the context of specific activities (Fig. 5.2a), they can be categorized andweighted. Interaction logs are used to infer metrics that describe the relationbetween individual actors (Fig. 5.2b), such as their behavior in terms of availabilityand reciprocity.

Our approach considers the diversity of trust by enabling the flexible aggregationof various interaction metrics that are determined by observing ongoing collabora-tions. Finally, available relation metrics are weighted, interpreted, and composedby a rule engine. The result describes trust between the actors with respect toscopes (Fig. 5.2c). For instance, trust relations in a scope “scientific dissemination”

Fig. 5.2 Trust emerging from interactions: (a) interaction patterns shape the behavior of actors incontext of activities; (b) rewarding of behavior and calculation of interaction metrics; (c) inferencein scopes by interpretation of metrics


could be interpreted from interaction behavior of actors in a set of “paper writing”activities. For detailed mechanisms on trust modeling and inference see [44].

5.4 Expert Communities

Here we discuss a PVC environment to introduce our concepts and to demonstratethe role of social relations and the emergence of social trust. Also, we motivethe need for brokers in PVC environments. A PVC is a virtual communityconsisting of experts who interact and collaborate by the means of informationand communication technologies to perform work [9]. In today’s collaborationenvironments, service-oriented technologies (Web services, RESTful services, etc.)are increasingly used to realize a collaboration infrastructure suitable for PVCs. Animportant aspect is the availability of a wide variety of tools and frameworks toimplement service-oriented systems.

5.4.1 Collaboration Scenario

The support of loose coupling, advanced discovery, dynamic binding and variouscomposition mechanisms makes SOA the ideal grounding for Web-enabled PVCs.Let us discuss an actual collaboration scenario in PVCs as depicted in Fig. 5.3.

Various member groups collaborate in the context of five different activitiesa1,a2,a3,a4 and a5 (see Fig. 5.3a). These groups intersect because group membersmay participate in different activities at the same time. The color of the activitydetermines the context and expertise areas an activity is related to. Such activitiesare, for example, the creation of new design specifications or the discussion ofemerging technology standards. Activities are a concept to structure information

Fig. 5.3 Collaboration model for service-oriented PVCs: (a) interactions between PVC membersare performed in the context of activities; (b) social relations and profile areas emerge based oninteractions; (c) concepts in the model

5.4 Expert Communities 103

in flexible collaboration environments [25], including the goal of the ongoing tasks,involved actors, and utilized resources such as documents or services [34]. Activitiesare either assigned from the outside of a community, e.g., belonging to a higher-levelbusiness process, or emerge by identifying collaboration opportunities.

To achieve the goals of activities, the PVC members use SOA technologyto interact in the context of the currently performed activities. In the depictedscenario, we the concept of Human-Provided Services (HPS) [39] and the HPSframework [34] is used is used to allow human participation in a service-orientedmanner. That is, humans can provide their capabilities and skills as services toenable human interactions through a standardized message exchange format (i.e.,SOAP). Instead of implementing services in software, services are provided byhuman actors. The novelty of the approach is that HPS enables a seamless service-oriented infrastructure consisting of human and software services. At the technicallevel, all messages including SOAP-based messages are logged for later analysis ofcommunities and broker discovery.

Relations emerge from interactions as illustrated in Fig. 5.3b and are boundto particular scopes. We model the interaction context with tags and keywordsand compose similar activities to build trust scopes. To infer trust, we aggregateinteractions that occurred in a pre-defined scope, calculate metrics (numeric valuesdescribing prior interaction behavior), and interpret them in terms of reliability,dependability and success. In the given scenario, a scope comprises trust relationsbetween PVC members regarding help and support in different expertise areas(reflected by tags of exchanged messages). Through analyzing the interactioncontext (i.e., using message tags), we determine a user’s centers of interests.Frequently used keywords are stored in the actors’ profiles (see symbol P) and laterused to determine their interests and expertise areas.

5.4.2 Brokerage and Composition

Consider a scenario in the given PVC in Fig. 5.3b where u wants to set up an activitythat requires at least one additional expert from the brown u,v,w and blue domainj,k, l,m. Since u personally knows v and w from previous collaborations (reflectedby social relations), u is well-connected to the brown expertise area; but u does notknow any member of the blue domain. However, in the scenario u collaboratedwith b in the green domain, who is connected to j. Hence, b could act as a brokerand forward requests or invitations to join u’s current activity to j. We argue thatestablishing personal contacts in socially-based environments is of high importancecompared to the traditional SOA domain, where services are mostly composedbased on their properties (i.e., features and QoS) only.

Interaction mining techniques support the discovery of emerging social relations.These relations have major impact on future collaborations such as:


• Supporting the Formation of Expert Groups. Successful previous formations ofactors should by “recorded” to actively facilitate future collaborations. Thus,tight trust relations based on interactions can be converted to social relations.

• Controlling Interactions and Delegations. Discovery and interactions betweenmembers can be based on social relations. People tend to favor help and supportrequests from well-known members instead of receiving requests from any third(personally unknown) parties.

• Establishment of new Social Relations. The emergence of new personal relationsis actively facilitated through brokers. The introduction of new partners throughbrokers (e.g., b introduces u and j to each other) leads to future trustworthycompositions.

5.4.3 Broker Patterns and Policies

Brokers as outlined in the beginning of Sect. 5.4 differ from the other actors inthe environment by their mediation capabilities. A broker acts as an intermediarybetween two previously separated collaboration teams. It is thus essential thatit gathers frequently demanded contacts and maintains its relations. However, ifdemand decreases the broker must find and establish new relations.

Broker Patterns The remainder of this section describes the broker patterns toestablish connections between requesters and third-party services. Brokers cantypically have different behavior when delegating requests. In particular, as shownin Fig. 5.4, we distinguish:

• Persistent Exogenous Interaction Pattern (Fig. 5.4a) where any requests andresponses are forwarded by the broker and the actually interacting nodes areshielded from each other.

• Triadic Exogenous Interaction Pattern (Fig. 5.4b) where the broker encouragesreceivers of messages to establish direct connections to the initially requestingnodes, and therefore, actively facilitates the emergence of trust relations.

b

u

wv

j k

lm

I. II.

III.IV.

b

u

wv

k

lm

I. II.

III.j

a b

Fig. 5.4 Exogenous broker behavior patterns. (a) Persistent interaction pattern. (b) Triadicinteraction pattern

5.5 SBQL Syntax 105

We argue that both types of interaction patterns are applied in today’s social andcollaborative environments. The broker may favor one over the other pattern dueto various reasons. On the one side, controlling the flow of interactions betweenpersonally unknown actors can strengthen a broker’s reputation. On the other side,establishing direct relations can significantly reduce a broker’s working load.

Behavior Policies Policies along with the queries provide general or explicit rule-sets to define expectations on path requests and responses. While the queries in thepresented scenario are limited to more static lookup properties such as, competencesand knowledge fields a policy-driven management of the interactions also allowsto consider the relation between a query and the current context. A request policylimits the request and result by an according rule-set. Rules include a condition partpossibly matching both current query and context and a decision part that filtersthe results. Thus, a request triggered policy could state that a requester wishes,e.g., a maximum/minimum path length to the knowledge source, a minimum trustrelation from her/his side to the broker or also from the broker to the expert,hard completion deadlines for the subsequently delegated work, etc. Thereby, rulesor entire policies can by mandatory or depending on the importance and contextdesirable for the issued query. Response policies at the broker comprise filter rulesfor the original query’s response paths. Depending on the situation the broker mightwant to adapt the response. The history of past interactions with the requester,contracts, or independent from those, current changes and events, e.g., interestshifts at or availability of the requested experts require dynamic adaptations ofthe response paths. These dynamics might also require that the policies’ rules areupdated accordingly. In order to solve the challenge of updating these situationdependent policies means of querying the network structure are necessary.

Our proposed Social Broker Query Language (SBQL) helps to solve the issuesby providing a syntax that can find paths to resources connected to query and policyconstrains. We introduce the technical details of SBQl in the following section.

5.5 SBQL Syntax

In this section we define the key elements of the SBQL. The language is inspired byan SQL-like syntax. It is important to note that SBQL operates on a graph structurecomposed of a set of nodes and edges.

• Select: A Select statement retrieves nodes and edges in graph G as well asaggregates of graph properties (for example, properties of a set of nodes).

• From: While traditional relational databases operate on tables, SBQL uses theFrom clause to perform queries on a graph G.

• Where: A Where clause specifies filters and policies upon nodes, edges, andpaths.


Table 5.1 Important SBQL language elements

Element Description

satisfy Requires that a given condition is fulfilled by a set of nodes or edges

as Creates an alias for groupings of nodes, edges, or paths

<all> Retain all nodes/edges/subgraphs satisfying a given condition

[ ] An expression to satisfy conditions for exactly one [1], one to m[1..m], or one to many [1..*] nodes or edges

1 Input: Graph G, var source = n1,n2, . . . ,ni,2 var target = n j ,n j+1, . . . ,n j+m3 Output: List of brokers45 Select node From (6 ( Select distinct(node) From G7 Where8 /* At least one in source ‘knows’ node */9 ( [1..*] n in source ) satisfy

10 Path (n to node) as P1 With P1.length = 1 ) as G1,11 ( target ) as G212 )13 Where14 /* Retain all nodes that satisfy path filter */15 ( <all> n in G1.nodes ) satisfy16 /* Path to any in G2.nodes */17 Path (n to [1..*] G2.nodes) as P2 With P2.length = 118 and19 /* Retain all edges that satisfy edge filter */20 ( <all> e in G1.edges ) satisfy21 (e.relation = EPredicates.BIDIRECTIONAL) and22 (e.trust >= MTrust.MEDIUM)2324 Order by node

i

h

g

... ...u

w

v j

m

k

l

d

f

e

b3

b1 ...

...

...

......

...

...

...b2

Fig. 5.5 SBQL query: find broker to connect two predefined communities

Note that SBQL is not a general purpose graph (social network) query languagebut rather a domain specific language to discover and rank brokers in PVCs aswell as social and collaborative networks. Table 5.1 lists important SBQL languageelements to query and filter graphs.

To give intuitive examples, we present a set of SBQL queries along with theirmeaning considering a graph G and a set of subgraphs G′ ⊆ G. We structurediscussions related to a query into four essential steps:

• R the basic requirements/goal of a query• A the approach that is taken• O the output of the query• D the detailed description of the query

5.5.1 Connecting Predefined Communities

Consider two initially disconnected communities (sets of nodes in Fig. 5.5)depicted as variables var source = n1,n2, . . . ,ni and var target =nj,nj+1, . . . ,nj+m residing in the graph G.

5.5 SBQL Syntax 107

R1: The goal is to find a broker connecting disjoint sets of nodes (i.e., not havingany direct links between each other).

A1: Two subgraphs G1 and G2 are created to determine brokers which connectthe source community u,v,w with the target community g,h, i (i.e., see Fromconstruct).

O1: The output of the query is (the example shown in Fig. 5.5) a list of brokersconnecting u,v,w and g,h, i. The lines 1–3 specify the input/output parametersof the query.

D1: As a first step, a (sub)select is performed using the statement as shown by thelines 6–10. The statement distinct(node) means that a set of unique brokersshall be selected based on the condition denoted as the Where clause with a filter(lines 9–10).

The term “[1..*] n in source”, where source is the set of nodespassed to the query as input argument, means that at least one node n ∈ G mustsatisfy the subsequent condition. Here the condition is that the node n has a link(i.e., through knows relations) to the source set of nodes. This is accomplishedby using the Path function that checks whether a link between two nodes exists(the argument “(n to node)”). The path alias is used to specify additionalconstraints such as the maximum path length between nodes (here “P1 WithP1.length = 1”). The second step is to create an alias G2 for the targetcommunity g,h, i.

By using the aliases G1 (line 10) and G2 (line 11) further filtering can beperformed using the Where clause in line 13. The same syntax is used as previouslyin the sub-select statement (lines 9–10). The construct <all> retains nodes “n inG1.nodes” (G1 holding the set of candidate brokers) that are connected to at leastone node in the target community G2 with direct links (“P2 with P2.length= 1”). Further filtering is performed by defining lines 18–20. Here, the targetcommunity g,h, i must have edges between each other that are bidirectional. Inour graph representation, this means that each relation has to be interpreted as, forexample, h knows g and g knows h. A set of different metrics is established in oursystem. A specific type of metric (e.g., trust) is denoted by the namespace MTrust.In the specified query, each actor in the target community must share a minimumlevel of trust depicted as “e.trust >= MTrust.MEDIUM”. Trust metrics areassociated to edges between actors. The term MTrust.MEDIUM is establishedbased on mining data to obtain linguistic representations by mapping discrete values(metrics) into meaningful intervals of trust levels.

The last statement “Order by node” in Fig. 5.5 implies a ranking procedureof brokers. This can be accomplished by using eigenvector methods in socialnetworks such as the PageRank algorithm [30] to establish authority scores (theimportance or social standing of a node in the network). The detailed mechanismsof this procedure will be discussed in the following section.


Fig. 5.6 SBQL query: find ranked communities based on search criteria and load metrics

5.5.2 Finding Communities

The broker discovery example in the previous section (depicted by Fig. 5.5) isstraightforward because the target community is already specified and passed tothe query as var target = nj,nj+1, . . . ,nj+m. However, in most cases (ashighlighted in the introduction example) the target community may not be knownbeforehand. The next example query eliminates this assumption by showing anapproach to find suitable communities based on search criteria (e.g., activity or skilltags).

R2: The goal of the query as specified in Fig. 5.6 is to find sub-communities (orsubgraphs) in G that match search criteria.

A2: Search is performed by using a set of distinct tags specified as inputparameter var search = t1, t2, . . . , tn.

O2: The output of the query is a list of communities.D2: The first step is to perform a (sub)select of distinct communities (see

distinct(nodes) as G’ in line 5) to obtain non-overlapping groups ofcommunity members specified by the lines 5–14. For example, Fig. 5.6 shows fourgroups of nodes [d,e, f ,g,h, i,l,m, j,k,u,v,w] each of them satisfying theconstraints specified in the query. Each node in a specific community must be linkedto at least one community member (“Path (n to [1..*] G’.nodes) asP1”). Furthermore, at least one path between nodes with “length = 1” satisfy-ing trust requirements (MTrust.HIGH) must exist in order to consider a node asa community member. Finally, a path must contain the tags specified by the searchquery (lines 11–12) to ensure that a member has interacted (collaborated) with othermembers in the context of certain activities.

The alias SG1 provides access to each community. The Where clause appliesfiltering of communities based on load conditions measured by graph metrics(GMLoad). For example, load conditions G’.load are measured by the numberof inbound requests and the number of pending tasks within the community.

5.6 Broker Ranking 109

1 Input: Graph G, var source = n1,n2, . . . ,ni,2 var search = t1, t2, . . . ,tn3 Output: List of brokers and communities45 Select node, nodes from (6 /* Select brokers */7 ( /* ... */ ) as G1,8 /* Select communities */9 ( /* ... */ ) as SG1

10 )11 Where12 ( <all> n in G1.nodes ) satisfy13 /* To one in SG1 */14 Path (n to [1] SG1) as P1 With P1.length = 11516 Order by node

i

h

g

... ...u

w

v j

m

k

l

d

f

e

b3

b1 ...

...

...

......

...

...

...b2

Fig. 5.7 SBQL query: find exclusive brokers to connect two communities

5.5.3 Finding Exclusive Brokers

The final SBQL example is depicted by Fig. 5.7 to combine previously introducedconcepts for broker discovery.

R3: The basic idea of this example is to find brokers that are connected to exactlyone candidate (target) community.

A3: Communities are retrieved along with brokers. Filtering is applied based onpaths to obtain exclusive brokers.

O3: The output of the query are brokers along with communities they areconnected to (e.g., b1, d,e, f ).

D3: First, a set of candidate brokers is retrieved and made available via the aliasG1 (line 6). This is the same procedure as introduced before (see Fig. 5.5). Second,communities are retrieved and stored in SG1 (line 8). Again, this is based on thesame principle as introduced previously in Fig. 5.6. We call brokers connectingexactly one community exclusive brokers. This is accomplished by the statements in11–13 demanding for “n to [1] SG1”. The broker b2 is a non-exclusive brokerbecause it connects multiple communities d,e, f and g,h, i, thereby makingg,h, i unreachable from the u,v,w community perspective.

5.6 Broker Ranking

5.6.1 Community Profiles

In contrast to common top-down approaches that apply taxonomies and ontologiesto define certain interest and expertise areas, we follow an interest mining approachthat addresses the inherent dynamics of flexible collaboration environments. Skillsand expertise as well as interests change over time, but are rarely updated ifthey are managed manually in a registry. Hence, we determine and update themautomatically through mining. As discussed before, interactions, e.g., delegation


t1 t1 t1 . . .Pc1 fc1(t1) fc1(t2) fc4(t3) . . .Pc2 fc2(t1) fc2(t2) fc4(t3) . . .Pc3 fc3(t1) fc3(t2) fc4(t3) . . .Pc4 fc4(t1) fc4(t2) fc4(t3) . . ....

......

.... . .

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11

0.75

0.5

0.25

0.1

0.0

1.0

tag

vect

or

sim

ilarit

y

T1 T2 T3 T4

T9

T5 T6

T7 T8

a b

Fig. 5.8 The concept of hierarchical tag clustering: (a) tag matrix to determine the co-occurrence(and thus potential similarity) of tags; (b) clustering of tag vectors with varying similaritythresholds creates a tag tree

of tasks or requests, are tagged with keywords. As delegation receivers processtasks, our system is able to learn how well people cope with certain tagged tasks;and therefore, able to determine their centers of interests. We calculate communityprofiles by aggregating individual (tag-based) interest profiles.

The community profile Pc in Eq. (5.1) describes the frequencies fc of the tagsT = t1, t2, t3 . . . that are applied in collaborations in a community c.

Pc = 〈fc(t1), fc(t2), fc(t3) . . . 〉 (5.1)

Combining multiple community profiles leads to a tag matrix as shown inFig. 5.8. Tag vectors tx describe the usage of a certain tag tx from a globalperspective, i.e., spanning all communities ci. A common assumption is that co-occurrence of tags reflect their similarity and closeness respectively [22, 43]. Wecluster tags based on their similarity. Clustering tags has two major advantages:

• When someone is searching for a particular tag, all similar tags (e.g., synonymsin the same cluster) can be considered in the search process as well to increasethe number of results.

• Since our ranking approach uses personalization techniques when searching forbrokers, we can significantly reduce the time effort by pre-calculating topic-basedbroker importance (see later for details).

By comparing the similarity of tag vectors, e.g., using cosine similarity, withvarying thresholds, a tree structure is created as depicted in Fig. 5.8. That treereflects the closeness of single tags and created clusters. For instance tags t1 and t2are merged in cluster T1. Details are described in [44]. The benefit of this approachin the context of SBQL is that fuzzy matches of communities and/or brokers arepossible and topic-based broker importance scores can be calculated at various topiclevels (at different levels in the hierarchical topic/tag tree).


5.6.2 Trust Weights

Interactions such as delegations are aggregated to metrics that are interpreted byrules to infer trust. Trust scores are associated to edges (i.e., e.trust) and mappedto trust intervals (e.g., MTrust.MEDIUM). To calculate metrics, the edge weightwuv can be interpreted as how much u trusts v in processing tasks or help and supportrequests in a reliable manner [Eq. (5.2)]. Specifically, experts’ behavior in termsof reliability and task processing successes, are periodically updated with recentcaptured interaction data.

wuv ≡ succ. delegations from u to v

∑z∈N(u) succ. delegations from u to z(5.2)

Reliability and processing success (and thus social trust) of tasks/delegations arebased on a task rewarding schema. Let us assume a human task ht. The task hasstates such as accepted, inprogress, finished or aborted. Rewards are automaticallyassociated with ht to measure the degree of success. For example, fast and reliableprocessing of tasks yields higher rewards, thereby resulting in higher trust in acollaboration partner.

To model task rewards based on temporal task properties (processing time), weuse a mathematical function belonging to the family of sigmoid functions with thegeneral form f (t) = 1

1+e−t (see Fig. 5.9). Sigmoid functions are typically used tomodel systems that saturate at large values of t, for example, the processing time oftasks. Let us define the essential properties of the model in the following. The taskrewarding function RW based on the task processing time PT(ht) for a given task htis defined as follows:

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1100

80

60

40

20

0

Task Progression

a b

Rew

ard

M1M2M3

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1100

80

60

40

20

0

Task Progression

Rew

ard

M1M2M3

Fig. 5.9 Task rewarding model. (a) Initial rewarding model. (b) Refined rewarding model


Table 5.2 Rewarding model and related parameters

Symbol Description

RW(PT(ht)) Rewarding function based on the task processing time PT(ht). The output ofthe function is a percentage value between 0 % and 100 %. 100 % is given ifthe task is processed very fast. 0 % reward is given is given if the task expiresand thus fails

ψ Saturation of RW : [0,τ ]→ [0,ψ], ψ ∈ [0,1]

σ Parameter to define the horizontal displacement of RW

δ Parameter to define the “steepness” of RW’s slope. Steepness means that thereward changes more quickly depending on task processing time

M Depicts a rewarding model for different parameters

RW(PT(ht)) =ψ

1+EXP(−PT(ht)−σδ

)

(5.3)

A description of the model’s parameters is given in Table 5.2:Figure 5.9a visualizes the output of RW depending on different parameters

(Table 5.2). The horizontal axis shows the task progression in terms of processingtime. Progression 0 means that the task has not yet started. Progression 1 meansthat the task has expired (maximum processing time has been reached). Thus, areward is given between 0 < PT(ht)/MAX(ht) < 1. The function MAX(ht) returnsthe maximum processing time of ht.

The basic idea is to use different models (e.g., M1,M2,M3) to account forthe risk that a particular type of task will not be processed in a timely manner.Risk is automatically calculated based on finished versus aborted tasks within thecommunity (the parameter δ in Table 5.2). The task rewarding function RW shouldfall less steeply if a particular type of task tends to be aborted by the community.To model risk for the task progression spectrum that is based on the task processingtime, RW needs to be refined as a stepwise function RW ′:

RW ′(PT(ht)) =

⎧⎨

⎩RW(PT(ht)) , if

PT(ht)MAX(ht)

< 0.5

RW(PT(ht);M) , otherwise(5.4)

Figure 5.9b shows RW ′. Progression (based on processing time) towards aparticular point (see 0.5 on horizontal axis) results in equal rewards regardless of themodel (M1,M2,M3). Beyond this point, tasks are differently rewarded dependingon the risk modeled by a given model M. For example, given M3 that models taskswith higher risks, higher rewards are given because a successfully processed taskbecomes more valuable for the task creator. The detailed calculation of the riskfactor in the model (the parameter δ ) is not explained in detail in this work. Thedetailed mechanism is explained in [33, pp. 85].

The benefit of this approach is that the rewarding function RW undergoes aself-configuration process by selecting a particular model M automatically based


on monitored interactions. For example, if many tasks are aborted and the risk ofunsuccessful task processing increases, also the reward function RW can be updatedand a different model M is selected.

5.6.3 Broker Importance

Here we introduce our broker ranking approach. As mentioned before, the goal ofbroker importance ranking is to implement the SBQL statement Order by (cf.Fig. 5.5). The basic idea of the approach is derived from the concept of hubs andauthorities wherein the hub importance of a node in a network is influenced bythe authority of the nodes the hub is connected to. This method is also knownas the concept Hyperlink-Induced Topic Search (HITS) as introduced by [20]. Anode’s authority is influenced by the hub importance of the node’s neighbors. Letus denote the importance of broker b as B(b) and the importance of community c asC(c). Communities are sets of nodes that are matched and grouped using SBQL. Acommunity/broker graph can be depicted as a graph G(V,E) where V depicts the setof nodes that may be either brokers or communities and E the set of edges to depictlinks between brokers and communities. Utilizing the idea of hubs and authorities,the broker and community importance is defined as follows:

B(b) = ∑c∈N(b)

C(c) C(c) = ∑b∈N(c)

B(b) (5.5)

The set of neighbors (either community or brokers neighbors) is depicted by N(c)for the set of brokers that are connected to c and N(b) for the set of communitiesthat are connected to broker b. Hence, b’s importance is directly influenced by theimportance of the community it is connected to. Similarly, the importance of c isbased upon the importance of brokers that are attached to c.

We make some important extensions to the basic model:

• We expand the models for B(b) and C(c) towards a PageRank-like model that ismore robust with regards to rank stability to small perturbations (see also [28])and can be personalized to topics of interest.

• Personalization relates to the previously discussed hierarchical tag clusteringsince broker or community importance scores can be calculate for certain interestareas (clusters).

• We additionally use broker and community weights. A broker may not equallyengage in all communities and thus weights propagate importance scores basedon the strength of the broker’s community involvement.

The resulting model is depicted by the following two equations (see alsoTable 5.3):


Table 5.3 Description of symbols

Symbol Description

p(b;T) The topic-sensitive broker personalization vector for topic T. The parameterλb is used to balance between personalization and “network importance”

p(c;T) The topic-sensitive community personalization vector for topic T. Similarly,the parameter λc balances between personalization and network importance

wTbc The topic-based broker community weight that is based on how much b

engages in c (e.g., number of interactions between b and c)

B(b;T) = (1−λb)p(b;T)+λb ∑c∈N(b)

wTbcC(c;T) (5.6)

C(c;T) = (1−λc)p(c;T)+λc ∑b∈N(c)

wTbcB(b;T) (5.7)

The broker and community scores, B(b;T) and C(c;T) respectively, can becalculated at various levels in the hierarchical tree depending on the topic ofinterest. However, it is important to note that the topic of interest depends on theactual query and its associated keywords. Therefore, these scores would need to becalculated for every query that is used to discover and rank brokers. For large socialand collaborative networks such an approach is not feasible due to computationalcomplexity. For example, for a large network computation of broker scores couldtalk some days or even weeks (depending on hardware resources). Clearly, thesescores need to be computed in an offline manner.

The ultimate goal is that topic-sensitive importance scores are computed offlineand at query time aggregated into a composite ranking score. We propose the PageR-ank linearity theorem to solve the problem of topic-sensitive broker importanceranking. The linearity theorem [16] is defined as:

Theorem 5.1 (Linearity). For any personalization vectors p1,p2 and weightsw1,w2 with w1 +w2 = 1, the following equality holds:

PPV(w1p1 +w2p2) = w1PPV(p1)+w2PPV(p2) (5.8)

The above equality states that personalized PageRank vectors PPV can becomposed as the weighted sum of PageRank vectors. To utilize the theorem, we needto arrive at a different mathematical representation of the broker importance B(b;T).In its current form the linearity theorem cannot be applied. First, we substituteC(c;T) [see Eq. (5.7)] in B(b;T) [see Eq. (5.6)] so that we have:

B(b;T) =(1−λb)p(b;T)+λb(1−λc) ∑c∈N(b)

wTbcp(c;T) (5.9)

+λbλc ∑c∈N(b)

∑b′∈N(c)

wTbcwT

b′cB(b′;Q) (5.10)

5.7 Evaluation 115

Let us define the personalization vector p′(b;T) as follows:

p′(b;T) =(1−λb)

(1−λc)p(b;T)+λb ∑

c∈N(b)

wTbcp(c;T) (5.11)

The personalization vector p′(b;T) can be simplified and rewritten if the λparameters are set to λb = λc:

p′(b;T) = p(b;T)+λ ∑c∈N(b)

wTbcp(c;T) (5.12)

The personalization vector in Eq. (5.12) has two components: p(b;T) shows thetopic specific personalization for broker b and ∑c∈N(b)wT

bcp(c;T) the topic-basedpersonalization of each community the broker b is connected to. Equation (5.13)shows B(b;T) using p′(b;T).

B(b;T) = (1−λ )p′(b;T)+λ 2 ∑c∈N(b)

∑b′∈N(c)

wTbcwT

b′cB(b′;T) (5.13)

Equation (5.13) has a PageRank-like structure and thus the linear theorem canbe applied. The following Eq. (5.14) shows the query-based aggregation of brokerimportance scores B(b;Q).

B(b;Q) = w1B(b;T1)+w2B(b;T2) with Q = T1,T2 (5.14)

The symbol Q in B(b;Q) depicts the SBQL query that contains the set ofdemanded topics T1 and T2, which are used to match communities and brokers. Theresult of the computation in Eq. (5.14) is a composite broker importance score that isused to rank brokers. Computing composite scores requires only two database readoperations to obtain the topic-based scores and aggregation of the respective scores.This greatly reduces the time needed to rank brokers.

5.7 Evaluation

5.7.1 Overview

We performed two kinds of experiments to evaluate SBQL and its broker importanceranking approach.

• Performance tests to evaluate the suitability of SBQL and its implementationto query data obtained from a real service-based testbed environment. Thetestbed generates instances of services communicating over real WS-stacksand allows for simulation of service behavior (response time, generated faults,


etc.) Our evaluations were gathered using the logging features of the Genesis2framework [17].

• Evaluation of the broker importance ranking approach using real data from avirtual collaboration environment. We used data from the ICT research projectshaving received grants under the EU’s Seventh Framework Programme (FP7)[27]. The data covers a period from 2007 to 2011. Research projects havemultiple partners and an organization can be the partner of multiple projects.

These experiments help to evaluate both efficiency of SBQL and quality withrespect to broker ranking. First, the performance results are discussed and secondthe broker ranking results.

5.7.2 Performance Tests

Here we discuss results related to SBQL performance tests obtained by using theGenesis2 service testbed. The setup is described in the following.

• Testbed environment. Genesis2 has a management interface and a controllableruntime to deploy, simulate, and evaluate SOA designs and implementations. Acollection of extensible elements for these environments are available such asmodels of services, clients, registries, and other SOA components. Each elementcan be set up individually with its own behavior, and steered during execution ofa test case.

• Backend deployment. For the experiments in this work we deployed Genesis2Backends to the Amazon Elastic Compute Cloud. We launched depending onthe amount of involved services instances of two or three Community AMIs ofthe type High-Memory Extra Large Instance (17.1 GB of memory) running aLinux OS. In the following we provided each instance with the same Genesis2Backend snapshot via mountable volumes from the Elastic Block Store. Finally,we deployed the following environment setup from a local Genesis2 Frontend. Itincluded SOA-based PVCs established by Genesis2 Web services equipped withsimulated behavior and predefined relations to provide communication channelsand instantiate online communities.

Services act like HPSs when delegating each other new tasks, processingtasks, re-delegating existing tasks, or reporting tasks’ progress status. In otherwords, by following the HPS concept, services are provided by human actorsand thereby exhibit human behavior. Tasks are not delegated arbitrarily but mustmatch the receivers capabilities. Therefore, they are tagged by three keywordsone of which must match the picked receivers capabilities. As an intermediate,a broker combines capabilities of the two communities it connects. The brokeravoids task processing and only forwards tasks. The finally deployed environ-ments are variable in number of services, number of participants per group(2–5 services) and consequently also in number of communities and required

5.7 Evaluation 117

brokers that connect at least each community with another. Task processing anddelegation decisions happen individually and in random time intervals (1–8 s).

• Client setup. We simulated environments with different numbers of nodesand interactions to obtain insights in performance aspects. SBQL tools andrelated graph libraries have been implemented in state-of-the-art technologyusing .NET/C# and have been deployed on our lab-based server infrastructure.The blade servers are equipped with 3.2 GHz quad core CPUs and 10 GB RAM.Interaction logs are managed by MySQL databases. A client request pool iscreated on a separate machine (Intel Core2 Duo CPU 2.50 GHz, 4 GB RAM)to perform parallel invocations of the SBQL query Web service. Clients areconnected with the server via a local 100 MBit Ethernet.

We performed several experiments to test the performance of our SBQL imple-mentation under varying characteristics such as number of nodes and groups.The results are summarized in the following. We performed a set of concurrentqueries (50 concurrent queries) time by launching multiple threads. For each loadexperiment, the total number of requests was 100 requests to be processed. Byprocessing a larger amount of requests, say 200, the total processing time linearlyincreases with the number of requests.

We increased the number of nodes and interactions to understand the scalabilityof SBQL under different conditions. The first experiment comprises 198 nodes, 200edges, and a total number of ten distinct tags applied to interactions between nodes.In experiment 2, we simulated 579 nodes, experiment 3 comprising 774 nodes,and experiment 4 with 1029 nodes in the tested. HPSs in the testbed have beendeployed equally on multiple hosts, e.g., three cloud hosts in experiment 4 to achievescalability. The cloud deployments have been done on Amazon’s cloud. The SBQLprocessing time for this environment is shown in Fig. 5.10.

To compare the experiments 1–4, we query the graph data using the querykeywords “Robustness” and “Logging” to obtain a set of matching brokers that areable to broker tasks related to those keywords. Increasing the number of nodes by

1 2 3 4

MIN

AVG

MAX

Experiment ID

Pro

cess

ing

Tim

e

0

20.000

40.000

60.000

80.000

100.000

Fig. 5.10 SBQL processing time (in milliseconds)


Table 5.4 SBQL queries using the data from experiment 4, number of discovered brokersand MIN processing time

ID SBQL query keywords Num. brokers MIN time

Q1 “Robustness”, “Log” 105 3993

Q2 “Robustness”, “Log”, “DB”, “Testbed” 134 3666

Q3 “Robustness”, “Log”, “DB”, “Testbed”, “Similarity” 146 3478

a factor of ≈ 3 (see the experiments 1 and 2), the processing time of SBQL queriesincreases by 30 %. Comparing the experiments 2 and 3 (node addition of ≈ 30%),the processing time increases by a factor of 10 %. By comparing the experiments 3and 4 (node addition of ≈ 30%), the SBQL processing time increases by a factor of20 %. These experiments show that SBQL scales with larger testbed environmentslinearly.

To test the effect of using various query keywords, we used different keywordcombinations as shown in Table 5.4. The first column shows the query ID, secondthe SBQL query keywords are shown, third the number of brokers are depicted, andforth the minimum SBQL processing time in milliseconds is shown.

The number of discovered brokers increases if multiple keywords are specified.However, the average SBQL processing time is not significantly influenced by thenumber of used keywords.

5.7.3 Ranking Experiments

The second set of experiments is based on research projects and partners havingreceived grants under the EU’s Seventh Framework Programme (FP7). Detailedstatistical information regarding the ICT community is described in [27] and coversa period from 2007 to 2011. Research projects have multiple partners and anorganization can be the partner of multiple projects. In prior research, we have usedthis dataset already to rank the importance of organizations using novel link miningtechniques [37]. In contrast, here we focus on broker discovery issues.

A total of 4747 organizations participated in the program contributing to atotal number of 1451 projects. There are 107 distinct strategic objectives, whichwill be regarded as communities in our experiment. Objectives are, for exam-ple, “Embedded Systems Design” and “ICT for Environmental Management andEnergy Efficiency”. Let us suppose that a broker shall be found to connect thesetwo communities. Notice in this context, for all broker discovery and rankingexperiments we selected 2 to 3 communities randomly. An organization needsto have performed projects in both “Embedded Systems Design” and “ICT forEnvironmental Management and Energy Efficiency” to qualify as a broker. Wefirst match all relevant organizations that qualify as brokers to connect these

5.7 Evaluation 119

Table 5.5 Matched and top-10 ranked brokers for “ICT for Environmental Management andEnergy Efficiency” and “Embedded Systems Design”

Rank Organization (broker) PC Score

1 FRAUNHOFER-GESELLSCHAFT ZUR FOERDERUNG DERANGEWANDTEN FORSCHUNG E.V

4 0.1928

2 ECOLE POLYTECHNIQUE FEDERALE DE LAUSANNE 6 0.1395

3 POLITECNICO DI MILANO 5 0.1268

4 COMMISSARIAT A L ENERGIE ATOMIQUE ET AUX ENERGIESALTERNATIVES

6 0.1225

5 UNIVERSITY OF SOUTHAMPTON 2 0.1015

6 INSTITUTE OF COMMUNICATION AND COMPUTER SYSTEMS 2 0.0994

7 UNIVERSITEIT TWENTE 2 0.0889

8 IMPERIAL COLLEGE OF SCIENCE, TECHNOLOGY ANDMEDICINE

3 0.0883

9 TECHNISCHE UNIVERSITAET GRAZ 4 0.0771

10 DEUTSCHES ZENTRUM FUER LUFT - UND RAUMFAHRT EV 2 0.0661

communities and then perform ranking using the introduced link based brokerimportance model [cf. Eq. (5.13)].

The parameter λ is set to λ = 0.85 (this is the suggested value by thePageRank model [30]), the broker personalization vector is assigned uniformly top′(b;T) = 1

numOrgs where numOrgs depicts the number of organizations (in this

case numOrgs = 4747), and the weight wTbc = 1

|N(b)| is based on the number of

communities |N(b)| (i.e., the number of neighbors in the graph) the broker b isconnected to.

The top-10 ranking results are detailed in Table 5.5. The first column shows therank (1. . . 10), the second column shows the organizations’ names, the third columnshows the number of projects (project count PC) that the organization has performedin the two objectives (“Embedded Systems Design” or “ICT for EnvironmentalManagement and Energy Efficiency”), and the last column shows the numericalbroker ranking score. The brokers are sorted according to the ranking score fromhigh to low.

In addition to the description in Table 5.5, Fig. 5.11 visualizes all brokersconnecting the two communities. The node size of the top-10 ranked brokersis based on their rank. The organization Fraunhofer-Gesellschaft has in general(within the ICT framework) the largest number of projects and receives the highestamount of funding. Since it is involved in many project, Fraunhofer also has thehighest importance because it is connected to many communities. With regardsto the topic-sensitive results, it is specifically involved in four projects relevantto the demanded objectives “ICT for Environmental Management and EnergyEfficiency” and “Embedded Systems Design”. The combination of high reputationand involvement in relevant communities makes Fraunhofer the top-ranked broker.The second ranked broker is Ecole Polytechnique Federale de Lausanne (EPFL)


INSTITUTE OF COMMUNICATION

ANDCOMPUTERSYSTEMS

UNIVERSITY OF SOUTHAMPTON

AIT Austrian Institute of Technology

GmbHEmbedded

Systems Design

ELECTRICITEDE FRANCE

S.A.

ICT for EnvironmentalManagementand Energy Efficiency

UNIVERSITEITVAN

AMSTERDAMTECHNISCHE

UNIVERSITAETGRAZ

DEUTSCHESZENTRUM FUER

LUFT − UND RAUMFAHRT EV

IMPERIALCOLLEGE OF

SCIENCE,TECHNOLOGYAND MEDICINE

INFINEONTECHNOLOGIES

AG

THEUNIVERSITY OF MANCHESTER

UNIVERSITADEGLI STUDI DI

PADOVA

FUNDACIONTECNALIA

RESEARCH & INNOVATION

UNIVERSITEITTWENTE

COMMISSARIATA L ENERGIE ATOMIQUE ET AUX ENERGIES ALTERNATIVES

FRAUNHOFER−GESELLSCHAFTZUR

FOERDERUNGDER

ANGEWANDTENFORSCHUNG

E.V

EUTEMATECHNOLOGYMANAGEMENT

GMBH

LANTIQDeutschland

GmbH

POLITECNICODI MILANO

ECOLEPOLYTECHNIQUE

FEDERALE DE LAUSANNE

GMVAEROSPACE

AND DEFENCE SA

UNIPERSONAL

Fig. 5.11 Ranked brokers connecting “ICT for Environmental Management and Energy Effi-ciency” and “Embedded Systems Design”

Table 5.6 Matched and ranked brokers for “Flexible, Organic and Large Area Electronicsand Photonics” and “Embodied Intelligence”


1 CENTRE NATIONAL DE LA RECHERCHE SCIENTIFIQUE 2 0.1398

2 IMPERIAL COLLEGE OF SCIENCE, TECHNOLOGY ANDMEDICINE

2 0.0883

with many relevant projects with regards to the objectives but with less “global”importance (within the entire ICT framework) than Fraunhofer. Also, Politecnico diMilano is involved in many relevant projects and thus ranks at position 3.

The broker ranking approach delivers the expected results and combines success-fully topic-relevant importance with global network importance.

Another example is shown in Table 5.6 where communities “Flexible, Organicand Large Area Electronics and Photonics” and “Embodied Intelligence” can bereached via two brokers. Both have two projects (one in each objective) and thusqualify as brokers.

A final example is shown by Table 5.7 where brokers need to connect threecommunities. Indeed, the number of discovered brokers varies depending oncommunity (objective) popularity. To conclude the discussion on ranking results,the proposed broker importance model delivers expected results by combiningglobal network importance of organizations with topic-based relevance. Thus, ourapproach is able to (a) identify brokers that formally match the search criteria and(b) rank brokers according to importance.

5.8 Conclusions 121

Table 5.7 Matched and ranked brokers for “ICT for Environmental Management andEnergy Efficiency” and “Intelligent information management” and “Micro/nanosystems”


1 FRAUNHOFER-GESELLSCHAFT ZUR FOERDERUNG DERANGEWANDTEN FORSCHUNG E.V

14 0.1928

2 ECOLE POLYTECHNIQUE FEDERALE DE LAUSANNE 9 0.1395

3 POLITECNICO DI MILANO 3 0.1268

4 UNIVERSIDAD POLITECNICA DE MADRID 5 0.1186

5 UNIVERSITY OF SOUTHAMPTON 5 0.1015

6 STIFTELSEN SINTEF 3 0.0813

5.7.4 Lessons Learned

This section provides some additional discussions with regards to lessons learnedand recommendations for theory and practice.

• SBQL provides a rich set of language features to state complex queries forbroker discovery. Distributed queries need to be considered also in future SBQLactivities. Currently it is assumed that logs are collected in a central repositoryand that SBQL queries are executed on a single instance database. In this regard,distributed log repositories and databases should be considered in SBQL queries.

• The performance tests showed that our SBQL implementation offers sufficientperformance for mid- to large-scale environments. In truly open ultra large-scale environments with potentially millions of people and services performanceand scalability need to be revisited. Aspects include distribution of queries andfederation of results.

• The broker importance ranking approach delivers excellent results and allows forfull personalization. The mathematical models and theories, as presented in thiswork, provide mature concepts and the basis for further work on personalizationtechniques.

• Further personalization may include the long-term stability of social relations,for example, or the frequency of changes of collaboration partners. The weightassignment of social relations is an important issue and should be furtheranalyzed by considering various VO datasets.

5.8 Conclusions

This work introduced a number of principles to support service oriented collab-oration in professional virtual communities. The problem in today’s collaborativenetworks is the fragmented nature of expertise areas and boundaries imposed byorganizational structures. Principles found in social network theory are suitablefor assisting in the formation process of socially-based compositions of human


and software services. Specifically, we adopted the well-established theory ofstructural holes to support the formation of such compositions. In this work,brokers help connecting independent communities by forwarding tasks and help andsupport requests to individual community members. Here we proposed a socially-based approach to discover brokers based on interaction mining and monitoringtechniques. Technically, we proposed SBQL to discover suitable brokers based onquery constraints and ranking criteria. SBQL introduces a domain specific languageto construct and execute complex queries specifically for the discovery of brokersin socially-based virtual communities. Here we introduced a broker importancemodel not only to match brokers but also to rank them according to their topic-based and community-wide relevance. Periodically updated metrics are used toweight interaction links and paths between actors. The proposed techniques andtechnical concepts have been implemented and validated through a services testbedand through experiments using real world data.

In our future work we will further work on tool support for broker querymodeling and debugging. Furthermore, we plan to use further additional socialnetwork data sources to perform broker ranking experiments. Also, we plan tosupport the execution of distributed SBQL queries.

References

1. Gautam Ahuja. Collaboration networks, structural holes, and innovation: A longitudinal study.Administrative Science Quarterly, 45(3):425–455, September 2000. ISSN 00018392. doi:10.2307/2667105.

2. Donovan Artz and Yolanda Gil. A survey of trust in computer science and the semantic web.J. Web Sem., 5(2):58–71, 2007.

3. Ashish Agrawal et al. WS-BPEL Extension for People (BPEL4People)., 2007.4. Lars Backstrom, Dan Huttenlocher, Jon Kleinberg, and Xiangyang Lan. Group formation in

large social networks: membership, growth, and evolution. In 12th ACM SIGKDD conferenceon knowledge discovery and data mining, KDD ’06, pages 44–54. ACM, 2006.

5. Marco Bertoni, Andreas Larsson, Åsa Ericson, Koteshwar Chirumalla, Tobias Larsson, OlaIsaksson, and Dave Randall. The rise of social product development. IJNVO, 11(2):188–207,2012.

6. Stephen P. Borgatti, Ajay Mehra, Daniel J. Brass, and Giuseppe Labianca. Network analysisin the social sciences. Science, 323(5916):892–895, 2009. doi: 10.1126/science.1165821.

7. Ronald S. Burt. Structural holes and good ideas. American Journal of Sociology, 110(2):349–399, September 2004.

8. Ronald S. Burt. Brokerage and Closure: An Introduction to Social Capital. Oxford UniversityPress, USA, October 2005.

9. Luis M. Camarinha-Matos and Hamideh Afsarmanesh. Collaborative networks. In PROLA-MAT, pages 26–40, 2006.

10. Catherine Dwyer, Starr Roxanne Hiltz, and Katia Passerini. Trust and privacy concern withinsocial networking sites: A comparison of facebook and myspace. In AMCIS, 2007.

11. Jennifer Golbeck. Trust and nuanced profile similarity in online social networks. TWEB, 3(4),2009.

12. Tyrone Grandison and Morris Sloman. A survey of trust in internet applications. IEEECommunications Surveys and Tutorials, 3(4), 2000.

http://dx.doi.org/10.2307/2667105

http://dx.doi.org/10.1126/science.1165821

References 123

13. R. Guha, Ravi Kumar, Prabhakar Raghavan, and Andrew Tomkins. Propagation of trust anddistrust. In WWW, pages 403–412, 2004.

14. Roger Guimera, Brian Uzzi, Jarrett Spiro, and LuÃNs A. Nunes Amaral. Team assemblymechanisms determine collaboration network structure and team performance. Science, 308(5722):697–702, 2005. doi: 10.1126/science.1106340.

15. Ralf Hartmut Güting. Graphdb: Modeling and querying graphs in databases. In Proceedingsof the 20th International Conference on Very Large Data Bases, VLDB ’94, pages 297–308,San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc.

16. Taher H. Haveliwala. Topic-sensitive pagerank. In WWW, pages 517–526, 2002.17. Lukasz Juszczyk and Schahram Dustdar. Script-based generation of dynamic testbeds for soa.

In International Conference on Web Services, 2010.18. Eva C. Kasper-Fuehrera and Neal M. Ashkanasy. Communicating trustworthiness and building

trust in interorganizational virtual organizations. Journal of Management, 27(3):235–254, June2001. doi: 10.1177/014920630102700302.

19. Florian Kerschbaum, Jochen Haller, Yücel Karabulut, and Philip Robinson. Pathtrust: A trust-based reputation service for virtual organization formation. In iTrust, pages 193–205, 2006.

20. Jon Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46:668–677, 1999.

21. Jon Kleinberg, Siddharth Suri, Éva Tardos, and Tom Wexler. Strategic network formation withstructural holes. ACM Conference on Electronic Commerce, 7(3):1–4, 2008.

22. Xin Li, Lei Guo, and Yihong Eric Zhao. Tag-based social interest discovery. In WWW’08: Proceeding of the 17th international conference on World Wide Web, pages 675–684,New York, NY, USA, 2008. ACM.

23. Tiancheng Lou and Jie Tang. Mining structural hole spanners through information diffusionin social networks. In Proceedings of the 22nd international conference on World Wide Web,WWW ’13, pages 825–836, 2013.

24. Miriam J. Metzger. Privacy, trust, and disclosure: Exploring barriers to electronic commerce.J. Computer-Mediated Communication, 9(4), 2004.

25. Thomas P. Moran, Alex Cozzi, and Stephen P. Farrell. Unified activity management:Supporting people in e-business. Com. of the ACM, 48(12):67–70, 2005.

26. Lik Mui, Mojdeh Mohtashemi, and Ari Halberstadt. A computational model of trust andreputation for e-businesses. In HICSS, page 188, 2002.

27. Filippo Munisteri. ICT statistical report for annual monitoring 2011. http://ec.europa.eu/digital-agenda/sites/digital-agenda/files/stream_2012_0.pdf, February 2012.

28. Andrew Y. Ng, Alice X. Zheng, and Michael I. Jordan. Stable algorithms for link analysis.In Proceedings of the 24th annual international ACM SIGIR conference on Research anddevelopment in information retrieval, SIGIR ’01, pages 258–266, New York, NY, USA, 2001.ACM.

29. Bart Nooteboom, Victor Gilsing, Wim Vanhaverbeke, Geert Duijsters, and Ad Oord. Networkembeddedness and the exploration of novel technologies: Technological distance, betweennesscentrality and density. Technical Report 2006-32, Tilburg University, Center for EconomicResearch, 2006.

30. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citationranking: Bringing order to the web. Technical report, Stanford University, 1998.

31. Joel M. Podolny et al. Resources and relationships: Social networks and mobility in theworkplace. American Sociological Review, 62:673–693, 1997.

32. Royi Ronen and Oded Shmueli. Soql: A language for querying and creating data in socialnetworks. In ICDE, pages 1595–1602, 2009.

33. Daniel Schall. Human Interactions in Mixed Systems - Architecture, Protocols, and Algorithms.PhD thesis, Vienna University of Technology, 2009.

34. Daniel Schall. A human-centric runtime framework for mixed service-oriented systems.Distributed and Parallel Databases, 29(5-6):333–360, 2011.

35. Daniel Schall. Expertise ranking using activity and contextual link measures. Data Knowl.Eng., 71(1):92–113, 2012a. doi: 10.1016/j.datak.2011.08.001.

http://10.1126/science.1106340

http://10.1177/014920630102700302



http://10.1016/j.datak.2011.08.001


36. Daniel Schall. Service Oriented Crowdsourcing: Architecture, Protocols and Algorithms.Springer Briefs in Computer Science. Springer New York, New York, NY, USA, 2012b. doi:10.1007/978-1-4614-5956-9.

37. Daniel Schall. Measuring contextual partner importance in scientific collaboration networks.Journal of Informetrics, 7(3):730–736, 2013a. doi: 10.1016/j.joi.2013.05.003.

38. Daniel Schall. Formation and interaction patterns in social crowdsourcing environments. Int.J. Communication Networks and Distributed Systems, 11(1):42–58, 2013b.

39. Daniel Schall, Hong-Linh Truong, and Schahram Dustdar. Unifying human and softwareservices in web-scale collaborations. IEEE Internet Computing, 12(3):62–68, 2008.

40. Daniel Schall, Florian Skopik, Harald Psaier, and Schahram Dustdar. Bridging socially-enhanced virtual communities. In ACM SAC, pages 792–799, 2011.

41. Diego Serrano, Eleni Stroulia, Denilson Barbosa, and Victor Guana. Sociql: A querylanguage for the socialweb. In Evangelos Kranakis, editor, Advances in Network Analysisand its Applications, volume 18 of Mathematics in Industry, pages 381–406. Springer BerlinHeidelberg, 2013.

42. Ahm Shamsuzzoha, Timo Kankaanpaa, Luis Maia Carneiro, and Ricardo Almeida. Method-ological support to establish a collaborative non-hierarchical business network for complexproduct manufacturing. IJNVO, 11(3/4):363–381, 2012.

43. Andriy Shepitsen, Jonathan Gemmell, Bamshad Mobasher, and Robin Burke. Personalizedrecommendation in social tagging systems using hierarchical clustering. In RecSys ’08: Pro-ceedings of the 2008 ACM conference on Recommender systems, pages 259–266, New York,NY, USA, 2008. ACM.

44. Florian Skopik, Daniel Schall, and Schahram Dustdar. Modeling and mining of dynamic trustin complex service-oriented systems. Information Systems, pages 735–757, 2010. doi: doi:10.1016/j.is.2010.03.001.

45. W3C. Sparql query language for rdf, 2008.46. Duncan J. Watts and Steven H. Strogatz. Collective dynamics of small-world networks. Nature,

393(6684):409–10, 1998.47. Brian Whitworth and Cheickna Sylla. A social environmental model of socio-technical

performance. IJNVO, 11(1):1–29, 2012.48. Cai-Nicolas Ziegler and Jennifer Golbeck. Investigating interactions of trust and interest

similarity. Decision Support Systems, 43(2):460–475, 2007.

http://10.1007/978-1-4614-5956-9

http://10.1016/j.joi.2013.05.003

doi:10.1016/j.is.2010.03.001

doi:10.1016/j.is.2010.03.001

Chapter 6Conclusion

Online social networks have become an integral part of our daily personal and workrelated activities. People use social networks not only to maintain relationships withfriends but also to communicate, collaborate, and share information. One of the mostprofound properties of social networks is their dynamic nature due to people joiningand leaving networks. This book introduced novel techniques for link formation insocial network based systems.

First, this book introduced link prediction in directed social networks. Theprediction of missing links and the prediction of future links is an important taskin the domain of social network analysis. The former helps to infer the “real” socialnetwork structure while the latter is used to give friendship as well as followingrecommendations to users. A wide range of local, global, and semi-local metricshave been proposed by previous work. A large body of existing literature, however,focuses on undirected networks only. This work closes this gap by focusing ondirected networks. The approach is based on local network information whereinprediction is performed based on node similarity.

The following two chapters introduced link formation through recommendationbased on global network information. Recommendation is based on the concept ofauthority and also structural holes. We proposed a novel follow recommendationapproach that is based on the concept of user authority. Instead of simply matchingusers by static skill profiles, we proposed a network-centric approach taking auser’s community engagement as well as social metrics into account. We havesystematically derived a mathematically sound model to measure user authoritybased on activity (e.g., repository commits) and community reputation (followerdegree). Furthermore, this work introduced various metrics for importance rankingin scientific collaboration environments. We proposed a novel topic-sensitiveauthority model that is based on well-establish ranking techniques. We systemat-ically derived a unified HITS/PageRank-based model that can be fully personalized.The second metric measures organizations’ structural importance based on thenotion of structural holes. In our approach structural importance is computed with


125

126 6 Conclusion

respect to certain topics of interest. Thus, structural importance helps identifyingorganizations that may be valuable partners for strategic alliances. Combined withauthority, this provides a powerful approach for ranking and discovering newpartners. Finally, authority and structural importance are systematically combinedwith cost. For that purpose we utilize AHP to achieve a trade-off among variousranking criteria. The proposed approach delivers very good results and providesmore accurate, topic-sensitive results when compared with other ranking techniques.

The notion of brokers is a hybrid (semi-local) formation approach wherein localinformation is used to constraint which nodes may act as brokers and global infor-mation to rank brokers based on their reputation. Technically, we proposed SBQLto discover suitable brokers based on query constraints and ranking criteria. SBQLintroduces a domain specific language to construct and execute complex queriesspecifically for the discovery of brokers in socially-based virtual communities. Herewe introduced a broker importance model not only to match brokers but also to rankthem according to their topic-based and community-wide relevance. Periodicallyupdated metrics are used to weight interaction links and paths between actors. Theproposed techniques and technical concepts have been implemented and validatedthrough a services testbed and through experiments using real world data.

Future work will focus more on deep learning techniques including adaptiverandom forest models that can be incrementally improved.

social network-based recommender systemspzs.dstu.dp.ua/datamining/social/bibl/social_network...d....

Documents