exploiting geographical location for team formation in ... file11 communicationcost...

21
Yuqiang Han 1 , Yao Wan 1,3 , Liang Chen 2 , Guandong Xu 3 , and Jian Wu 1 1 Zhejiang University, Hangzhou, China 2 Sun Yat-Sen University, Guangzhou, China 3 University of Technology, Sydney, Australia May 25, 2017 Exploiting Geographical Location for Team Formation in Social Coding Sites PAKDD2017 May 23-26, 2017, Jeju, South Korea

Upload: doliem

Post on 15-Aug-2019

214 views

Category:

Documents


0 download

TRANSCRIPT

Yuqiang Han1, Yao Wan1,3, Liang Chen2, Guandong Xu3, and Jian Wu11 Zhejiang University, Hangzhou, China

2 Sun Yat-Sen University, Guangzhou, China3 University of Technology, Sydney, Australia

May 25, 2017

Exploiting Geographical Location for TeamFormation in Social Coding Sites

PAKDD2017May 23-26, 2017, Jeju, South Korea

2

q Introduction§ Background§ Motivation§ Challenge

qModel§ Communication Cost§ Geographical Proximity Cost§ Combined Cost§ GA-based Optimization

q Dataset

q Experiments

q Conclusion and Future Work

Outline

3

v What is team formation?

Background

Given a task and a set of experts(organized in anetwork), find the subset of experts that caneffectively perform the task.

v Applications

ü Collaboration networks(e.g., scientists, developers)ü Organizational structure of companiesü Team-based hiring

4

T = {C++, Python, Graphics, Algorithms}

Background

A{Algorithms}

B{C++}

C{Python, Graphics}

D{C++, Algorithms}

E{Python, Graphics, C++}

A

B C D

E

What’s the best team we can recommend?

5

Background

v What is Social Coding?

It is an approach to software development focusing oneffective collaboration.

6

Background

v Geographical proximity

Geographical proximity is playing an increasing important role inmany domains, such as knowledge production and technologicalinnovation, in spite of rapid development in telecommunications

technology.

7

v Team formation in social coding sites

Ø Developers(defining the set 𝑽, with |𝑽| = 𝒏)Ø Every developer 𝒊 is associated with a set of skills 𝑺𝒊Ø and a geographical location𝒈𝒊

Ø ProjectsØ Every project𝑷 is associated with a set of skills required for

completing the project

Ø A social coding network of developers(𝑮 = (𝑽, 𝑬,𝒘))Ø Weight on the edge indicates communication cost

Motivation

8

Motivation

0.2

0.2

0.3

0.3

0.1

Location Developer Skill Project

a

d

c

b

e

v Given a project and a social coding network of developers,find the subset(team) of developersI. each skill in project will be covered by the specified number of

developersII. each developer will cover and only cover one skillIII. the communication cost and geographical proximity cost are as

minimum as possible

9

Challenge

define the communication cost?

define the geographical proximitycost?

combine the communication costand geographical proximity cost?

10

Communication Cost

v Communication cost between two developers

Thecommunicationcostisthesumofweightsontheshortestpathbetweentwodevelopersinsocialcodingnetworks.

The lower the communication cost is, the more easily they cancollaborate with each other.

In social coding networks such as GitHub, the weights of edges aredefined as

𝑤 𝑢, 𝑣 = 1 −|𝑁6 ∩ 𝑁8||𝑁6 ∪ 𝑁8|

Where 𝑁6 and 𝑁8 is the set of projects in which 𝑢 and 𝑣 are listed as contributors respectively.

11

Communication Cost

v Communication cost of a team

Kargar,Mehdi,andAijun An."Discoveringtop-kteamsofexpertswith/withoutaleaderinsocialnetworks."

Givenasocialcodingnetwork𝐺whoseedgesareweightedbythecommunicationcostbetweentwodevelopersandateam𝑇 ofdevelopersfrom𝐺,thecommunicationcostof𝑇 isdefinedas

𝑆𝐶𝐶 𝑇 = > > 𝑐𝑐(𝑒A, 𝑒B)C

BDAEF

C

ADF

where𝑐𝑐(𝑒A, 𝑒B) isthecommunicationcostofdeveloper𝑒A and𝑒B.

A

B C

E0.5

0.30.2

0.4

SCC = 0.2 + 0.6 + 0.9 + 0.4 + 0.7 +0.3 = 3.1

12

Geographical Proximity Cost

v Geographical proximity cost of a team

Givenateam𝑇ofexperts,whereeachhavingalocationcode,thegeographicalproximitycost ofteam 𝑇 isdefinedas

𝑆𝐺𝑃 𝑇 = > > 𝑔𝑝(𝑒A, 𝑒B)C

BDAEF

C

ADF

where𝑔𝑝(𝑒A, 𝑒B) isthegeographical proximity costofdeveloper𝑒A and𝑒B.

The geographical proximity of two developers is defined as

𝑔𝑝 𝑢, 𝑣 = J0,1.𝑖𝑓𝑢𝑎𝑛𝑑𝑣𝑎𝑟𝑒𝑖𝑛𝑡ℎ𝑒𝑠𝑎𝑚𝑒𝑟𝑒𝑔𝑖𝑜𝑛

𝑜𝑡ℎ𝑒𝑟𝑠It is related to the differences in culture, work habits and so on.

13

Combined Cost

Givenasocial coding networkandatrade-off𝜆 betweenthecommunicationcostandgeographicalproximity,wedefinethecombinedcostoftheteam𝑇as

𝐶𝑜𝑚𝐶𝑜𝑠𝑡 𝑇 = 1 − 𝜆 ×𝑆𝐶𝐶 𝑇 + 𝜆×𝑆𝐺𝑃(𝑇)

The parameter 𝜆 varying from 0 to 1 indicates the tradeoff between communicationcost and geographical proximity cost.

v The combined cost function

TeamFormationbyMinimizingtheCombinedCostNP-hard

14

𝑆𝑘𝑖𝑙𝑙1 𝑆𝑘𝑖𝑙𝑙2 𝑆𝑘𝑖𝑙𝑙3 𝑆𝑘𝑖𝑙𝑙4

𝑑1 𝑑2 𝑑8𝑑7𝑑6𝑑5𝑑4𝑑3

GA-based Optimization

v Genetic algorithm based optimization

Selection

Crossover

Mutation

Evaluation

Solution Set

TerminationCriterion?

Yes

No

Initial populationgeneration

Encoding

Evaluation

Fitness function =Combined cost function s

15

Dataset

(a) Top10countrieswiththelargestnumberofdevelopers.

(b) Distributionoflocationdiversitydistribution,consideringthecompositionofteam

GitHub: 36,701developers, 3,532,453 projects, 1,610,072 edges.

Observation:inmostteams(nearly55%),thedeveloperscomefromnomorethanoneortwocountries.

16

Experiments

v Experiments Setup

Parameter Value

Population size 200

Number of generation 100

Crossover probability 0.2

Mutation probability 0.8

Number of skills 𝑘 10

Tradeoff 𝜆 0.5

Iterations for each experiments 10

17

Experiments

v Evaluation metrics

v Performance comparison

1. Communication cost: revealstheefficiencyofcommunicationbetweendevelopers

2. Geographical proximity cost: revealshowcloselythedevelopersoftheteamintermsofgeographicallocation

3. Combined cost: reveals the effect of combination ofcommunication cost and geographical proximity cost

1. Random Algorithm2. Approximation Rare Algorithm3. Minimum Cost Contribution Rare Algorithm

18

Experimentsv Experiments results

Analysis:1. Theproposedmodelachievesbetterperformance becausetheit

considersthegeographicalproximityduringtheprocessoffindingaoptimalteam.

2. Theproposedmodelachievesbetterperformance becauseit hasalargersearchspace.

3. Theproposedmodelachievesbetterperformancebecauseit considersboth the costs.

19

Experiments

v Impact of number of skills

Tostudytheimpactofnumberof skills ontheperformance,wesetthenumber𝑘 = {2, 4, 6, 8, 10}. Andforeach𝑘,wegenerate10randomprojectstotaketheaverageresult.The proposed modelcanalwaysachievebetterperformance.

20

Conclusion and Future Work

v Conclusion

v Future Work

1. Exploitthegeographicallocationofdeveloperstoboosttheperformanceof teamformationinsocialcodingsites.

2. Incorporatethecommunicationcostandgeographicalproximitycost intoaunifiedobjectivefunctionandemploygeneticalgorithmtooptimizeit.

3. Comprehensiveexperimentsonareal-worlddatasetillustratethe effectiveness of the proposed approach.

1. Investigate the impact of social media on the performance ofteam formation.

2. Exploit the interaction patterns for the accurate interpretationof link strength between developers.

Q & A

21

Thank [email protected]