systematic mining of software actionable...
TRANSCRIPT
software evolution & architecture lab
Systematic Mining of Software Repositories (Lectures 1 & 2)Actionable
University of Zurich, Switzerland http://seal.ifi.uzh.ch @ LASER summer school 2014
Harald Gall
Type to enter text
Roadmap
‣ I. Some Mining Studies ‣ II. Software Quality Analysis and Mining ‣ III. Replication and Benchmarking ‣ IV. Tooling
2
Type to enter text
What to mine
‣ Version Control Systems, Issue Trackers ‣ Github, StackOverflow ‣ Change locations, change types ‣ Questions asked by Web Developers ‣ Multiple Repositories ‣ Community Contributions ‣ Commit messages, security questions on Github ‣ For test suite improvement ‣ Static analysis alerts ‣ Code-to-comment mappings
Type to enter text
What to predict
‣ Defects ‣ Issue assignees ‣ Bug fix times ‣ Bug life cycles ‣ Reassignment of bugs ‣ Blocking bugs in OSS ‣ Changes
Type to enter text
Recent Trends
‣ Git, Github, StackOverflow !
‣ Just-in-Time (cross-projects) defect prediction [Fukushima et al., MSR ‘14]
‣ Universal model for defect prediction !
‣ MIP ICSE 2014: Zimmermann et al.
Type to enter text
What do we want to know?
‣ Does distributed development affect software quality? ‣ Cross-project defect prediction: when does it work? ‣ Visual (Effort Estimation) Patterns in Issue Tracking Data ‣ Analyzing the co-evolution of Comments and Source Code ‣ Supporting Developers with Natural Language Queries ‣ Can Developer-Module Networks Predict Failures? ‣ Predicting the fix time of bugs ‣ Interactive Views for Analyzing Problem Reports ‣ Visual Understanding of Source Code Dependencies
Type to enter text
Example: Most bug-prone files
‣ Which ones would you expect? ‣ Yes, the largest ones, or the packages
[Koru et al., TSE ‘09] !
‣ How useful is this? (code, developers) !
‣ Granularity matters!
Type to enter text
Issues with Mining
‣ Traditional prediction models often make recommendations at a granularity that is too coarse to be applied in practice, e.g. packagesKamei et al., TSE ’13; Shihab et al., ESEC/FSE ’12 !
‣ “MSR’s potential is enormous. My chief concern is that we’ll be distracted by the pleasure of and opportunities for finding relationships in and across these repositories that aren’t causal.” David Notkin, IEEE Software ‘09
Type to enter text
Overview
‣ Empirical studies on defect prediction ‣ (A) Distributed development & defects ‣ (B) Cross-project defect prediction ‣ (C) Fine-grained bug prediction
Type to enter text
Actionable Studies: Example
‣ Goal: Bug Triaging ‣ Question: Who should fix this bug?
Anvik et al., ICSE ‘06 ‣ Metrics: Similarity of bug fixes ‣ Actionable result: Similar (enough) bugs should be
triaged to the same people for efficient bug resolution (e.g., avoiding re-assignment of bugs)
Type to enter text
Defect prediction
‣ Learn a prediction model from historic data
‣ Predict defects for the same project
‣ Hundreds of prediction models exist
‣ Models work fairly well with precision and recall of up to 80%
Predictor Precision Recall
Pre-‐Release Bugs 73.80% 62.90%
Test Coverage 83.80% 54.40%
Dependencies 74.40% 69.90%
Code Complexity 79.30% 66.00%
Code Churn 78.60% 79.90%
Org. Structure 86.20% 84.00%From: N. Nagappan, B. Murphy, and V. Basili. The influence of organizational structure on software quality. ICSE 2008.
Our research
Fine-Grained Changes vs. Code Churn
Method-Level Bug PredictionPredicting the Type of Code Changes
Type to enter text
From coarse- to fine-grained predictions
‣ Code churn can be imprecise ‣ Changes manifest in small areas such as
condition changes, branch updates, or algorithmic tuning
‣ Source code changes would therefore be more helpful
Type to enter text
Source code changes (SCC)
‣ cDecl = changes to class declarations ‣ oState = insertion and deletion of class attributes ‣ func = insertion and deletion of methods ‣ mDecl = changes to method declarations ‣ stmt = insertion or deletion of executable statements ‣ cond = changes to conditional expressions ‣ else = insertion and deletion of else-parts
Type to enter text
Predicting bug-prone files
‣ Bug-prone vs. not bug-prone !
!!
‣ 10-fold cross validation ‣ whole data set split into 10 sub-samples of equal size ‣ train model with 9 sub-samples and predict bugs on the
10th ‣ repeat sub-sampling ‣ 10 runs later, performance measurements are averaged
Type to enter text
Prediction performance
‣ SCC performs significantly better than Code Churn (LM) ‣ Advanced learners are not always better ‣ Change types do not yield extra discriminatory power ‣ Even predicting the number of bugs is “possible”
‣ Non linear regression model with a median R^2 = 0.79
‣ More information in:“Comparing Fine-Grained Source Code Changes And Code Churn For Bug Prediction” Giger et al. MSR ‘11
software evolution & architecture lab
Christian Bird, Nachiappan Nagappan, Premkumar Devanbu, Harald Gall, Brendan Murphy
Distributed Software Development in Windows Vista wrt Code Quality
Distributed Team Development
■ 3,000+ Software Engineers ■ 3,300+ Binaries (dll’s, exe’s, sys’s) ■ 60+ MLOC ■ Developed in over 59 buildings,
21 campuses, 11 countries in Asia, Europe, and North America
Type to enter text
Data Timeline
‣ We gather commits, code data, organizational data, etc. up to Vista RTM.
‣ We measure code quality as post-release failures in the first 6 months following Vista’s release
Type to enter text
Examining Binaries
‣ We categorize the level of distribution by looking at the lowest level entity from which at least 75% of the commits come from
World 5.9%
Continent 0.2%
Locality 5.6%
Campus 17%
Cafeteria 2.3%
Building 68%
Split the binaries into Collocated and
Collocated Distributed
1. bldg cafe, cmps, lclty, cont, world
2. bldg, cafe cmps, lclty, cont world
3. bldg, cafe, cmps lclty, cont, world
4. bldg, cafe, cmps, lclty, cont, world
5. bldg, cafe, cmps, lclty, cont world
No statistical difference!
We used linear regression to examine effect of distribution on post-release failures
Type to enter text
Split the binaries into Collocated and Distributed
‣ Despite differences in scale, distributions are the same
software evolution & architecture lab
Thomas Zimmermann, Nachiappan Nagappan, Harald Gall, Emanuel Giger, Brendan Murphy
Cross-project defect prediction: A large scale experiment on data vs. domain vs. process
Type to enter text
Why cross-project prediction?
‣ Some projects do not have enough data to train prediction models or the data is of poor quality
‣ New projects do have no data yet ‣ e.g., version 1.0 cold start problem for prediction
‣ Can such projects use models from other projects?(= cross-project prediction)
‣ When and to which extent does cross-project prediction work?
Type to enter text
A first experiment: Firefox and IE
‣ Q1. To which extent is it possible to use cross-project data for defect prediction?
‣ Q2. Which kinds of systems are good predictors? What is the influence of data, domain, company, and process?
Firefox IE
precision=0.76; recall=0.88
precision=0.54; recall=0.04
Firefox can predict defects in IE.
!But IE cannot predict Firefox. WHY?
Type to enter text
Experiment outline
‣ 12 systems with 28 datasets ‣ different versions ‣ different levels of analysis (components, files)
‣ Run all 622 cross-project combinations ‣ for example Firefox and IE is one combination ‣ then train model from Firefox data, test on IE ‣ ignore invalid combinations, e.g., do not train from Eclipse 3.0
and test on 2.1
Systems studied
Project Total LOC Total Churn
Firefox 3.2 – 3.3 MLOC 0.64 – 0.95 MLOC
Internet Explorer 2.30 MLOC 2.20 MLOC
Direct-X 1.50 MLOC 1.00 MLOC
Internet Information Services (IIS) 2.00 MLOC 1.20 MLOC
Clustering 0.65 MLOC 0.84 MLOC
Printing 2.40 MLOC 2.20 MLOC
File system 2.00 MLOC 2.20 MLOC
Kernel 1.90 MLOC 3.20 MLOC
SQL Server 2005 4.6 MLOC 7.2 MLOC
Eclipse 0.79 – 1.3 MLOC 1.0 - 2.1 MLOC
Apache Derby 0.49 – 0.53 MLOC 4 – 23 KLOC
Apache Tomcat 0.25 – 0.26 MLOC 8 – 98 KLOC
Type to enter text
Data used in prediction models
‣ Relative code measures on churn, complexity and pre-release bugs ‣ Added LOC / Total LOC ‣ Deleted LOC / Total LOC ‣ Modified LOC / Total LOC ‣ (Added + deleted + modified LOC) / (Commits + 1) ‣ Cyclomatic Complexity / Total LOC ‣ Pre-release bugs / Total LOC
Type to enter text
Issues & Actions
‣ Issues: ‣ Privacy and Utility in Cross-Project Defect Prediction !
‣ Actionable?
software evolution & architecture lab
Method-level bug prediction
Emanuel Giger, Martin Pinzger, Harald Gall
Type to enter text
!H1: Fine-grained code changes exhibit a stronger correlation with the number of bugs.
!H2: Fine-grained code changes achieve better performance to classify source files into bug- and not bug-prone files.
Code Churn VS. fine-grained changes
Type to enter text
Data set
‣ 15 Eclipse Plugins ‣ ~850’000 fine-grained source code changes (SCC) ‣ ~9’700’000 lines modified (= Code Churn) ‣ ~9 years of development history..... ‣ and enough bugs ;-)
SCC correlation with bugs
• Entire dataset • Spearman rank corr. • Code Churn: 0.66 • Fine-grained chang.: 0.77 • Non-param. test, α=0.05 • Significant
Type to enter text
Performance for Bug-Proneness
Logistic Regression
AUC: 0.90
AUC: 0.85
Fine-grained changes
Code Churn
Non-param. test α=0.05 was significant
Bug-Prone
Not Bug-Prone
Type to enter text
Method-Level Prediction
11 methods on average
focus on few that are bug prone
class 1 class 2 class 3 class n...class 2
Retrieving bug-prone methods can help to save manual inspection and to improve testing effort allocation
Layout
H: Using fine-grained changes we can predict bug localities on method-level. Q: To which extent is it possible to build method-level prediction models using change and source code metrics?
Commit Message 1.3: “Bug #007: self-references fixed [...]”
Eclipse, Azureus, Jena, OpenXava, Apache
projects
Study
Fine-grained change measures and
code metrics
Random Forest Bug-prone
Not Bug-prone
AUC: 0.9, Prec.: 0.84, Rec.: 0.82
47
Type to enter text
Apache Derby
‣ Derby Release 10.2.2.0 ‣ 12% bug-prone methods (post-release) ‣ AUC 0.9, P= 0.53, R= 0.7 ‣ Class TernaryOperatorNode:
‣ 30 methods ‣ 6 bug-prone methods with fixes in release 10.3.1.4 !
‣ Actionable?
48
software evolution & architecture lab
What about Performance of Defect Prediction Models?
Jayalath Ekanayake, Jonas Tappolet, Abraham Bernstein, Harald Gall
Type to enter text
Prediction Quality vs. Time
‣ Eclipse heat-map: Prediction quality on same target using different training periods with the point of highest AUC highlighted
Type to enter text
Studies and Issues
‣ Bug predictions do work ‣ Cross-project predictions do not really work ‣ Data sets (systems) need to be “harmonized” ‣ Data preprocessing and learners need to be
calibrated ‣ Studies need to be replicable (systematically) ‣ Periods of stability vs. drift
Type to enter text
Future of MSR?
Vision statement Status open
Answer Commonly Asked Project Questions Michael W. Godfrey partly
Software Repositories: A Strategic Asset Ahmed E. Hassan yes
Create Centralized Data Repositories James Herbsleb yes
Embed Mining in Developer Tools Gail C. Murphy partly
Help Developers Search for Information Martin Robillard partly
Deploy Mining to Industry Audris Mockus ongoing
Let Us Not Mine for Fool’s Gold David Notkin ongoing
based on a Software Roundtable published in IEEE Software 2009