understanding the evolution of software project communities
DESCRIPTION
Presented at SATTOSE (Seminar on Advanced Tools and Techniques on Software Evolution), Koblenz, Germany, 7 April 2011TRANSCRIPT
Understanding the evolution of software project communities
Tom Mens & Mathieu GoeminneService de Génie Logiciel, Institut d’Informatique
Faculté des Sciences, Université de Mons
Université de Mons Mathieu Goeminne & Tom Mens
• Study of open source software evolution
• Taking into account the community (social network) of persons surrounding a software project (developers, users, ...)
• By analysing and combining data from different types of repositories
Research topic
Thursday 7 april 2011, SATTOSE seminar, Koblenz, Germany
Université de Mons Mathieu Goeminne & Tom MensThursday 7 april 2011, SATTOSE seminar, Koblenz, Germany
Goals
• Understand (through repository mining and analysis)
• How software quality is impacted by how the community is structured
• How do software product, project and community co-evolve
• How developers and users interact and influence one another
• Improve (through guidelines and tool support)
• The way in which communities should be structured
• The quality of the software process and product
• The interaction between developers and users
Université de Mons Mathieu Goeminne & Tom MensThursday 7 april 2011, SATTOSE seminar, Koblenz, Germany
Questions• Is there a core group (of developers and/or users) being significantly
more active than the others?
• How does a person contribute to different types of activities?
• Are the recurrent patterns of activity in the community?
• How do particular “events” impact the project?
• How does developer intake and turnover affect the project?
• Do we find evidence of Pareto principle (inequality of activity)
• Is there a “bus factor” effect?
• How does the activity distribution evolve over time?
• How does the activity distribution vary across different projects?
Université de Mons
Why open source?
• Free access to source code, defect data, developer and user communication
• Observable communities
• Observable activities
• Increasing popularity for personal and commercial use
• A huge range of community and software sizes
Mathieu Goeminne & Tom MensThursday 7 april 2011, SATTOSE seminar, Koblenz, Germany
Université de Mons Mathieu Goeminne & Tom MensThursday 7 april 2011, SATTOSE seminar, Koblenz, Germany
Methodology• Exploit available data from different repositories
• code repositories
• mail repositories (mailing lists)
• bug repositories (bug trackers)
• Select open source projects
• Use Herdsman framework
• Based on FLOSSMetrics data extraction
• Use identity merging tool
• Use of econometrics
• Use of statistical analysis and visualisation
Université de Mons Mathieu Goeminne & Tom MensThursday 7 april 2011, SATTOSE seminar, Koblenz, Germany
Selected Projects
• Criteria for selecting projects
• Availability of data from repositories
• Data processable by FLOSSMetrics tools
• CVSAnaly2, MLStats, Bicho
• Size of considered projects: persons involved, code size, activity in each repository
• Age of considered projects
Université de Mons Mathieu Goeminne & Tom MensThursday 7 april 2011, SATTOSE seminar, Koblenz, Germany
Selected ProjectsBrasero Evince Subversion Wine
versioning system git svn svn git
age (years) 8 11 11 11
size (KLOC) 107 580 422 2001
#commits 4100 4000 51529 74500
#mails 460 1800 24673 14000
#bugs 250 950 3300
Université de Mons Mathieu Goeminne & Tom MensThursday 7 april 2011, SATTOSE seminar, Koblenz, Germany
Herdsman framework
• Exploiting available data from different repositories
• code repositories: to detect committer activity
• mail repositories (mailing lists): to detect mailer activity
• bug repositories (bug trackers): to detect bug-related activities
Université de Mons
Herdsman Framework
Mathieu Goeminne & Tom MensThursday 7 april 2011, SATTOSE seminar, Koblenz, Germany
Identity Merging
Comparing Merge Algorithms• Based on a reference merge model• manually created• iterative approach• relying on information contained in different files (COMMITTERS, MAINTAINERS, AUTHORS, NEWS, README)• Compute, for each algorithm, precision and recall w.r.t. reference model
Reference merge modelBrasero Evince
Before: one repository account = one personAfter: one person may have multiple accounts
Université de Mons Mathieu Goeminne & Tom MensThursday 7 april 2011, SATTOSE seminar, Koblenz, Germany
Comparing Merge Algorithms
• Simple
• Bird (code and mail repositories only)
• based on Levenshtein distance
• Bird extended for bug repositories
• Improved
• Combining ideas from Bird and Robles
Comparing Merge AlgorithmsBrasero Evince
Comparing Merge Algorithms(varying parameter values) - Evince
ImprovedSimple
Bird
Bird extended
Université de Mons Mathieu Goeminne & Tom MensThursday 7 april 2011, SATTOSE seminar, Koblenz, Germany
Activity distribution
• How are developer activities (commits, mails, bug fixes) distributed?
• For a single release: do we observe an unequal distribution ?
• Over time: do we observe a change in this distribution ?
Université de Mons Mathieu Goeminne & Tom MensThursday 7 april 2011, SATTOSE seminar, Koblenz, Germany
Activity distributionFor a single release
• Evidence of Pareto principle (20/80 rule)?
• Most activity is carried out by a small group of persons.
• Typically : 20% do 80% of the job.
• Doesn’t necessarily imply that the activity distribution follows a Pareto law
Université de Mons
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Mathieu Goeminne & Tom MensThursday 7 april 2011, SATTOSE seminar, Koblenz, Germany
Activity distributionBrasero
Evince
Wine
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
commits
mails
br changes
Université de Mons Mathieu Goeminne & Tom MensThursday 7 april 2011, SATTOSE seminar, Koblenz, Germany
Activity distributionOver time
• Econometrics
• express inequality in a distribution
• aggregation metrics:Gini, Hoover, Theil (normalised)
• Values between 0 and 1
• 0 = perfect equality; 1 = perfect inequality
Université de Mons Mathieu Goeminne & Tom MensThursday 7 april 2011, SATTOSE seminar, Koblenz, Germany
Activity distribution (aggregation indices for Evince)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Apr-99 Dec-99 Aug-00 Apr-01 Dec-01 Aug-02 Apr-03 Dec-03 Aug-04 Apr-05 Dec-05 Aug-06 Apr-07 Dec-07 Aug-08
Gini Hoover Theil (normalised)
Université de Mons Mathieu Goeminne & Tom MensThursday 7 april 2011, SATTOSE seminar, Koblenz, Germany
Activity distribution (Gini index)
!"
!#$"
!#%"
!#&"
!#'"
!#("
!#)"
!#*"
!#+"
!#,"
$"
-./0!)" 1230!*" -./0!*" 1230!+" -./0!+" 1230!," -./0!," 1230$!" -./0$!"
.4556/7"
58697"
:;"0".<8=>?7"
!"
!#$"
!#%"
!#&"
!#'"
!#("
!#)"
!#*"
!#+"
!#,"
$"
-./0,," -./0!!" -./0!$" -./0!%" -./0!&" -./0!'" -./0!(" -./0!)" -./0!*" -./0!+" -./0!," -./0$!"
1233456"
37486"
9:;"/<.2/5"1=7>;<6"
Brasero
Evince
!"
!#$"
!#%"
!#&"
!#'"
!#("
!#)"
!#*"
!#+"
!#,"
$"
-./0,+"-./0,," 1230!!" 1230!$" 1230!%" 1230!&" 1230!'" 1230!(" 1230!)" 1230!*" 1230!+" 1230!," 1230$!"
2.44536"
47586"
9:;"<=>.<3"2?7@;=6"
Wine
Université de Mons Mathieu Goeminne & Tom MensThursday 7 april 2011, SATTOSE seminar, Koblenz, Germany
Activity DistributionConclusion
• Activity distributions seem to become more and more unequally distributed
• The Pareto principle is clearly present in studied projects
Université de Mons Mathieu Goeminne & Tom MensThursday 7 april 2011, SATTOSE seminar, Koblenz, Germany
Activity distribution Future work
• Identify the type of statistical distribution
• Use sliding windows for studying activity distribution over time
• useful to detect impact of personnel turnover
• ignore persons that have become inactive, and discover new active persons.
Université de Mons Mathieu Goeminne & Tom MensThursday 7 april 2011, SATTOSE seminar, Koblenz, Germany
Identifying core groups
• Display Venn diagrams of most active (top 20) persons, according to each definition of activity(committing, mailing, bug report changing)
• For each person, show the percentage of activity attributable to this person
• Take into account identity merges
Université de Mons Mathieu Goeminne & Tom MensThursday 7 april 2011, SATTOSE seminar, Koblenz, Germany
Identifying core groupsBrasero
Evince
Wine
Université de Mons Mathieu Goeminne & Tom MensThursday 7 april 2011, SATTOSE seminar, Koblenz, Germany
Identifying core groupsConclusion
• For Brasero and Evince, the activity is led by a limited number of persons involved in 2 or 3 of the defined activities.
• For Wine, it seems not to be the case.
Université de Mons Mathieu Goeminne & Tom MensThursday 7 april 2011, SATTOSE seminar, Koblenz, Germany
Identifying core groupsFuture work
• Automate this process for the entire project community
• Study the evolution of core groups over time
• Does a core group remain stable or does it change often? Why?
• Can we find evidence for a “bus factor”?
Université de Mons Mathieu Goeminne & Tom MensThursday 7 april 2011, SATTOSE seminar, Koblenz, Germany
More future work
• Study correlation between community structure (social network) and source code quality (as computed using software metrics).
• Extend and refine types of activity:
• different types of commit activity (doc, source code, test, etc.); of mail activity (information, asking, answering, etc.); of bug repository activity (bug creation, modification and commenting)
Université de Mons Mathieu Goeminne & Tom MensThursday 7 april 2011, SATTOSE seminar, Koblenz, Germany
Thank you