mining development repositories to study the impact of collaboration on software systems
DESCRIPTION
Talk given at the 2011 ESEC/FSETRANSCRIPT
SOFTWARE ANALYSIS
& INTELLIGENCE LAB
Mining Development Repositories to Study the Impact of
Collaboration on Software Systems
Nicolas [email protected]
1Wednesday, 11 April, 12
Software Development is a Social Activity
Source Code stands in direct relation to organizational structure. [Conway:Datamation:1968]
Developers spent large part of work day communicating with fellow developers. [Begel:ICSE:2010]
2Wednesday, 11 April, 12
Communication is Critical for Success
Communication is the most referenced problem in distributed development.
[Bird:ACMComm:2009][Grinter:GROUP:1999]
3Wednesday, 11 April, 12
Research Hypothesis
“The collaboration between stakeholders impacts the code quality and the development
community of a software system.”
4Wednesday, 11 April, 12
Proposed Approach
I. Extraction of communication data
II. Study impact on software quality
III. Study impact on development community
5Wednesday, 11 April, 12
Proposed Approach
I. Extraction of communication data
II. Study impact on software quality
III. Study impact on development community
6Wednesday, 11 April, 12
Available Knowledge in Data
Version Control Systems Mailing Lists Issue Tracking Systems
7Wednesday, 11 April, 12
Available Knowledge in Data
Version Control Systems Mailing Lists Issue Tracking Systems
Communication Data
7Wednesday, 11 April, 12
Available Knowledge in Data
Version Control Systems Mailing Lists Issue Tracking Systems
Communication Data• Source Code Comments• Change-Log Messages• Developer Emails & Discussions• Support Dialogues
7Wednesday, 11 April, 12
In this report, you have defined a parameter named blocksize,which is given a value of "7|D|1|D". In open script of data set, there are below lines code:
<script begin>token=Packages.java.util.StringTokenizer(params["blocksize"],"|");vec=new Packages.java.util.Vector();while(token.hasMoreTokens()){ vec.addElement(token.nextToken());}params["DateRange"]=java.lang.Integer.parseInt(vec.elementAt(0));</script end>
Since the value of params["blocksize"] is "7|D|1|D", vec.elementAt(0) is "7", and then it can not be parsed to int value. In 1.0.1,the value of params["blocksize"] might be 7|D|1|D, so it can be parsed to int value of 7.
Eclipse #150222
Extraction and processing of unstructured data is challenging. [MUD:Workshop:2010]
Communication Data Exists Mainly as Unstructured Data
8Wednesday, 11 April, 12
Mining Collaboration Data
[Bettenburg:ICPC:2011]
A Lightweight Approach to Uncover Technical Information in Unstructured Data
Nicolas Bettenburg, Bram Adams, Ahmed E. Hassan
Software Analysis and Intelligence Lab
Queen’s University
Kingston, Ontario, Canada
Email: {nicbet,bram,ahmed}@cs.queensu.ca
Michel SmidtDept. of Computer Science
University of BremenBremen, Germany
Email: [email protected]
A
b
s
t
r
a
c
t
—Developer communication through email, chat, or
issue report comments consists mostly of largely unstructured
data, i.e., natural language text, mixed with technical informa-
tion such as project-specific jargon, abbreviations, source code
patches, stack traces and identifiers. These technical artifacts
represent a valuable source of knowledge on the technical
part of the system, with a wide range of applications from
establishing traceability links to creating project-specific vo-
cabularies. However, the free-style delimiters between natural
language and technical content make the mining of technical
artifacts challenging. As a first step towards a general-purpose
technique to extracting all kinds of technical information
from unstructured data, we present a lightweight approach
to untangle technical artifacts and natural language text. Our
approach is based on existing spell checking tools, which are
well-understood, fast, readily available across platforms and
impartial to different kinds of technical artifacts. Through a
handcrafted benchmark, we demonstrate that our approach
is able to successfully uncover a wide range of technical
information in unstructured data.
K
e
y
w
o
r
d
s
-text mining, language analysis, unstructured data,
technical information.
I. INTRODUCTION
Every software system has a unique history of design
decisions, software changes, as well as development and
maintenance effort. This history is captured throughout the
development process in the variety of repositories used to
store data during the collaborative development process. As
this data contains the knowledge and rationale behind the
evolution of a software system, it is valuable for many differ-
ent fields, in particular program comprehension, and hence
should be made available to practitioners and researchers
alike.However, much of the information surrounding the devel-
opment process comes in the form of unstructured data [1],
which is conceptually different from the sources of struc-
tured data that researchers have used in previous research.
Structured data (e.g., source code) is well-defined and can
be readily parsed and understood by computer machinery.
Unstructured data (e.g., developer communication, issue
reports, documentation, email or meeting notes [2]), consists
of a mixture of natural language text and technical informa-
tion, such as code fragments, abbreviations, references to
objects in the source code, file names, logging information
Build ID: M20070212-1330
Steps To Reproduce:
1. Create a plugin for eclipse that includes a key binding for "M1+S" (ie. Alt+S)
where S is any letter that is used as a mnemonic in one of the top level
menus. Since eclipse uses "S" as the mnemonic for Help > &Software Updates,
"S" is sufficient. 2. Launch the plugin as part of Eclipse IDE
3. Press Alt+H to bring down the Help menu (to go along with our example in #1)
BUG: Notice "Software Updates" is missing its mnemonic.
More information: The code after "if (callback.isAcceleratorInUse
(SWT.ALT | character))" inside
Eclipse's MenuManager.java removes the mnemonic, but it seems like Eclipse
should be checking "isAcceleratorInUse" only for top level menumanagers like
File,Edit,...,Help, etc. :
/* (non-Javadoc)
* @see org.eclipse.jface.action.IContributionItem#update(java.lan
g.String)
*/ public void update(String pro
perty) {
IContributionItem items[] = getItems();
for (int i = 0; i < items.len
gth; i++) {
items[i].update(property);
} [...] } Any status on this bug?
I'd consider any contributions for M6 (API) or M7 (non-API) [...]
A 3.5 fix would be to make that behaviour optional in MenuManager with API and
off by default early in 3.5, and to have the WorkbenchActionBuilder contributed
MenuManagers and actionSets/editorActions contributed MenuManagers turn it on
(if I can find MenuManagers in the correct place).
I'd like us to work with the SWT team to make sure we understand what the
correct platform behavior is, and make sure that we aren't getting in the way
of that. The current behavior (i.e. turning off mnemonics) seems odd to me, in
general. If we're going to fix this, we should fix it properly.
Figure 1. Examples of technical information uncovered by a prototype
implementation of the approach proposed in this paper. (Eclipse Platform
Bug #208626).
or project-specific terms. As such, mining unstructured data
is challenging: it is meant for the exchange of information
between humans, rather than automated processing using
computer machinery. Figure 1 presents an example of tech-
nical information commonly found in unstructured data.
Recent approaches for discovering technical information
in unstructured data [3]–[5] have focussed on recognizing
and extracting only particular types of technical information,
such as class names [3], stack traces, or patches [5]. In order
to resolve the inherent ambiguities between natural language
text and technical information, these approaches are highly
specialized and tailored towards their specific use cases, and
limited in their scope. Furthermore, many kinds of technical
information (e.g, project-specific jargon or abbreviations)
cannot be extracted by any of the existing techniques.
As a first step towards a lightweight, general-purpose ap-
proach to uncovering technical information in unstructured
data, this work presents an approach that makes use of state-
of-the-art tools for checking and correcting the spelling and
grammar of electronically written texts. Technical informa-
tion is conceptually different from natural language text: it
often consists of words that are not part of standard language
dictionaries, violate grammatical conventions, and do not
respect morphological language rules. These characteristics
render modern spellcheckers ideal candidates for lightweight
• Use Spellchecking• Empirical validation• Improved on state of the art
9Wednesday, 11 April, 12
Proposed Approach
I. Extraction of communication data
II. Study impact on software quality
III. Study impact on development community
10Wednesday, 11 April, 12
Proposed Approach
I. Extraction of communication data
II. Study impact on software quality
III. Study impact on development community
10Wednesday, 11 April, 12
I. Extraction of communication data
II. Study impact on software quality
III. Study impact on development community
Proposed Approach
11Wednesday, 11 April, 12
I. Extraction of communication data
II. Study impact on software quality
III. Study impact on development community
Proposed Approach
11Wednesday, 11 April, 12
Quantify Impact on Quality: Idea
Extracted Communication Data
12Wednesday, 11 April, 12
Quantify Impact on Quality: Idea
Extracted Communication Data
Social Metrics
compute
12Wednesday, 11 April, 12
Quantify Impact on Quality: Idea
Extracted Communication Data
Social Metrics
compute
Post-Release Defects
measure relationships
12Wednesday, 11 April, 12
DiscussionCONTENT
SocialSTRUCTURES
CommunicationDYNAMICS
Measures of WORKFLOW
4 Dimensionsof Measures
13Wednesday, 11 April, 12
Conceptual Approach
time
6 months
MeasureDiscussionMetrics
6 months
MeasurePost-Release
Bugs
LINK USING STATISTICAL MODELS14Wednesday, 11 April, 12
Findings of our work
(1) Social metrics explain post-release defects as good as code metrics.
15Wednesday, 11 April, 12
Findings of our work
(1) Social metrics explain post-release defects as good as code metrics.
(2) Combination of social metrics and code metrics is cumulative.
15Wednesday, 11 April, 12
Findings of our work
(1) Social metrics explain post-release defects as good as code metrics.
(2) Combination of social metrics and code metrics is cumulative.
(3) Identify factors that have positive and negative relationships with defects.
15Wednesday, 11 April, 12
Findings of our work
(1) Social metrics explain post-release defects as good as code metrics.
(2) Combination of social metrics and code metrics is cumulative.
(3) Identify factors that have positive and negative relationships with defects.
[ICPC‘2010] (Best Paper)[JEMSE?]
15Wednesday, 11 April, 12
I. Extraction of communication data
II. Study impact on software quality
III. Study impact on development community
Proposed Approach
16Wednesday, 11 April, 12
I. Extraction of communication data
II. Study impact on software quality
III. Study impact on development community
Proposed Approach
16Wednesday, 11 April, 12
I. Extraction of communication data
II. Study impact on software quality
III. Study impact on development community
Proposed Approach
16Wednesday, 11 April, 12
Proposed Approach
I. Extraction of communication data
II. Study impact on software quality
III. Study impact on development community
17Wednesday, 11 April, 12
Proposed Approach
I. Extraction of communication data
II. Study impact on software quality
III. Study impact on development community
17Wednesday, 11 April, 12
Proposed Approach
I. Extraction of communication data
II. Study impact on software quality
III. Study impact on development community
17Wednesday, 11 April, 12
Available Knowledge in Data
Code Review Systems Mailing Lists Issue Tracking Systems
Data on Management of Code Contributions
18Wednesday, 11 April, 12
Submission
Review VerificationIntegration
ProjectRepository
Patch
Feedback
FeedbackOK OK
Contribution Management
19Wednesday, 11 April, 12
Studying Impact on Community throughContribution Management
Study how contributors, reviewers, verifiers and the software are impacted by communication (anomalies)through statistical models.
Goal:
Example:Reviewers leaving community due to lack of feedback
20Wednesday, 11 April, 12
Available Knowledge in Data
Version Control Systems Mailing Lists Issue Tracking Systems
Workflow InformationSocial Networks
21Wednesday, 11 April, 12
mozilla
paulc
zhangchunlin
kbrosnan
sdwilsh
samuel.sidler+oldhasham8888
myles7897
deletesoftwareabillings
eddy_nigg
jmjeffery
sgautherie.bz
john.p.baker
l10n
adelfino
jo.hermans
jruderman
nightstalkerz
alice0775
hskupin
mmortal03
tchung
marcia
me.at.work
fittysix
steve.england
cbook
tonglebeak
ctalbert
VYV03354
ehsan
alex
nrthomas
aarobertxtr
smichaud shaver
johnjbartonmanujsabarwal
jdaggett
matt
bzbarsky
dtownsend
davemgarrett
info
stephen.donner
elmar.ludwig sdaugherty
mak77jdarmochwal
polidobj
vseerrortwalker
dietrich
mconnorbeltzner
steffen.wilberg
mano
highmind63
ria.klaassen
robert.bugzilla
edilee
kliu
faaborg
marco.zehesylvain.pasche bugzilla
rotisuliss
cl-bugs-new2
anselm.meyer
timwi
RainerStroebel
tomer
gavin.sharp
jbecerra
johnath
kev
martijn.martijn
cwwmozilla
longsonr
m-wada
zenikodveditz
matspal
philringnalda
zurtex
bomfog
cjcypoi02 corevette
masayukireed
phiw
timeless
matti
mh+mozilla
dao
klaas1988
sziadeh mark.finkle
UIJavaScriptEngine
XML Parser
Internet Explorer
Evolution of Code-Knowledge Communities
22Wednesday, 11 April, 12
Thesis Progress
Tools and techniquesfor mining communication repositories
Empirical Validationof presented tools and techniques
Empirical Validationof relationship between collaboration
and software quality.
Empirical Validationof relationship between collaboration
and development teams.
23Wednesday, 11 April, 12
Thesis Progress
Tools and techniquesfor mining communication repositories
Empirical Validationof presented tools and techniques
Empirical Validationof relationship between collaboration
and software quality.
Empirical Validationof relationship between collaboration
and development teams.
23Wednesday, 11 April, 12
Thesis Progress
Tools and techniquesfor mining communication repositories
Empirical Validationof presented tools and techniques
Empirical Validationof relationship between collaboration
and software quality.
Empirical Validationof relationship between collaboration
and development teams.
23Wednesday, 11 April, 12
Thesis Progress
Tools and techniquesfor mining communication repositories
Empirical Validationof presented tools and techniques
Empirical Validationof relationship between collaboration
and software quality.
Empirical Validationof relationship between collaboration
and development teams.
23Wednesday, 11 April, 12
Thesis Progress
Tools and techniquesfor mining communication repositories
Empirical Validationof presented tools and techniques
Empirical Validationof relationship between collaboration
and software quality.
Empirical Validationof relationship between collaboration
and development teams.
23Wednesday, 11 April, 12
Points for Discussion
• How to do evaluation of code-knowledge communities? (ground truth)?
• Applicability to industrial settings (almost no communication data records available)?
• Extend work to defect prediction?• Practical implications: management,
moderation, staffing, ... ?
24Wednesday, 11 April, 12