systematic mining of software actionable...

software evolution & architecture lab

Systematic Mining of Software Repositories (Lectures 1 & 2)Actionable

University of Zurich, Switzerland http://seal.ifi.uzh.ch @ LASER summer school 2014

Harald Gall

http://seal.ifi.uzh.ch

Type to enter text

Roadmap

‣ I. Some Mining Studies ‣ II. Software Quality Analysis and Mining ‣ III. Replication and Benchmarking ‣ IV. Tooling

2


I. Empirical Studies in Mining Software Archives

Type to enter text

What to mine

‣ Version Control Systems, Issue Trackers ‣ Github, StackOverflow ‣ Change locations, change types ‣ Questions asked by Web Developers ‣ Multiple Repositories ‣ Community Contributions ‣ Commit messages, security questions on Github ‣ For test suite improvement ‣ Static analysis alerts ‣ Code-to-comment mappings

Type to enter text

What to predict

‣ Defects ‣ Issue assignees ‣ Bug fix times ‣ Bug life cycles ‣ Reassignment of bugs ‣ Blocking bugs in OSS ‣ Changes

Type to enter text

Recent Trends

‣ Git, Github, StackOverflow !

‣ Just-in-Time (cross-projects) defect prediction [Fukushima et al., MSR ‘14]

‣ Universal model for defect prediction !

‣ MIP ICSE 2014: Zimmermann et al.

Type to enter text

Speaking of Git

‣ How many commands does Git have? !

‣ 152

Type to enter text

What do we want to know?

‣ Does distributed development affect software quality? ‣ Cross-project defect prediction: when does it work? ‣ Visual (Effort Estimation) Patterns in Issue Tracking Data ‣ Analyzing the co-evolution of Comments and Source Code ‣ Supporting Developers with Natural Language Queries ‣ Can Developer-Module Networks Predict Failures? ‣ Predicting the fix time of bugs ‣ Interactive Views for Analyzing Problem Reports ‣ Visual Understanding of Source Code Dependencies

Type to enter text

Example: Most bug-prone files

‣ Which ones would you expect? ‣ Yes, the largest ones, or the packages

[Koru et al., TSE ‘09] !

‣ How useful is this? (code, developers) !

‣ Granularity matters!

Type to enter text

Issues with Mining

‣ Traditional prediction models often make recommendations at a granularity that is too coarse to be applied in practice, e.g. packagesKamei et al., TSE ’13; Shihab et al., ESEC/FSE ’12 !

‣ “MSR’s potential is enormous. My chief concern is that we’ll be distracted by the pleasure of and opportunities for finding relationships in and across these repositories that aren’t causal.” David Notkin, IEEE Software ‘09

Type to enter text

Overview

‣ Empirical studies on defect prediction ‣ (A) Distributed development & defects ‣ (B) Cross-project defect prediction ‣ (C) Fine-grained bug prediction


Bug Prediction

Type to enter text

Actionable Studies: Example

‣ Goal: Bug Triaging ‣ Question: Who should fix this bug?

Anvik et al., ICSE ‘06 ‣ Metrics: Similarity of bug fixes ‣ Actionable result: Similar (enough) bugs should be

triaged to the same people for efficient bug resolution (e.g., avoiding re-assignment of bugs)

http://dx.doi.org/10.1145/1134285.1134336

Type to enter text

Defect prediction

‣ Learn a prediction model from historic data

‣ Predict defects for the same project

‣ Hundreds of prediction models exist

‣ Models work fairly well with precision and recall of up to 80%

Predictor Precision Recall

Pre-‐Release Bugs 73.80% 62.90%

Test Coverage 83.80% 54.40%

Dependencies 74.40% 69.90%

Code Complexity 79.30% 66.00%

Code Churn 78.60% 79.90%

Org. Structure 86.20% 84.00%From: N. Nagappan, B. Murphy, and V. Basili. The influence of organizational structure on software quality. ICSE 2008.

Bug Prediction Models

Our research

Fine-Grained Changes vs. Code Churn

Method-Level Bug PredictionPredicting the Type of Code Changes


Fine-grained source code changes for defect prediction

Type to enter text

From coarse- to fine-grained predictions

‣ Code churn can be imprecise ‣ Changes manifest in small areas such as

condition changes, branch updates, or algorithmic tuning

‣ Source code changes would therefore be more helpful

Type to enter text

Infrastructure

Type to enter text

Source code changes (SCC)

‣ cDecl = changes to class declarations ‣ oState = insertion and deletion of class attributes ‣ func = insertion and deletion of methods ‣ mDecl = changes to method declarations ‣ stmt = insertion or deletion of executable statements ‣ cond = changes to conditional expressions ‣ else = insertion and deletion of else-parts

Type to enter text

Predicting bug-prone files

‣ Bug-prone vs. not bug-prone !

!!

‣ 10-fold cross validation ‣ whole data set split into 10 sub-samples of equal size ‣ train model with 9 sub-samples and predict bugs on the

10th ‣ repeat sub-sampling ‣ 10 runs later, performance measurements are averaged

Type to enter text

SCC vs. Churn (LM)

‣ models trained with logistic regression

Type to enter text

Prediction performance

‣ SCC performs significantly better than Code Churn (LM) ‣ Advanced learners are not always better ‣ Change types do not yield extra discriminatory power ‣ Even predicting the number of bugs is “possible”

‣ Non linear regression model with a median R^2 = 0.79

‣ More information in:“Comparing Fine-Grained Source Code Changes And Code Churn For Bug Prediction” Giger et al. MSR ‘11

Type to enter text

Actionable result?

‣ Using code changes as predictor


Christian Bird, Nachiappan Nagappan, Premkumar Devanbu, Harald Gall, Brendan Murphy

Distributed Software Development in Windows Vista wrt Code Quality

Distributed Team Development

■ 3,000+ Software Engineers ■ 3,300+ Binaries (dll’s, exe’s, sys’s) ■ 60+ MLOC ■ Developed in over 59 buildings,

21 campuses, 11 countries in Asia, Europe, and North America

Type to enter text

Data Timeline

‣ We gather commits, code data, organizational data, etc. up to Vista RTM.

‣ We measure code quality as post-release failures in the first 6 months following Vista’s release

Type to enter text

Examining Binaries

‣ We categorize the level of distribution by looking at the lowest level entity from which at least 75% of the commits come from

World 5.9%

Continent 0.2%

Locality 5.6%

Campus 17%

Cafeteria 2.3%

Building 68%

Split the binaries into Collocated and

Collocated Distributed

1. bldg cafe, cmps, lclty, cont, world

2. bldg, cafe cmps, lclty, cont world

3. bldg, cafe, cmps lclty, cont, world

4. bldg, cafe, cmps, lclty, cont, world

5. bldg, cafe, cmps, lclty, cont world

No statistical difference!

We used linear regression to examine effect of distribution on post-release failures

Type to enter text

Split the binaries into Collocated and Distributed

‣ Despite differences in scale, distributions are the same

Type to enter text

Why?

‣ (Development) Process ‣ (Senior) People ‣ Tools

!

‣ Actionable?


Thomas Zimmermann, Nachiappan Nagappan, Harald Gall, Emanuel Giger, Brendan Murphy

Cross-project defect prediction: A large scale experiment on data vs. domain vs. process

Type to enter text

Why cross-project prediction?

‣ Some projects do not have enough data to train prediction models or the data is of poor quality

‣ New projects do have no data yet ‣ e.g., version 1.0 cold start problem for prediction

‣ Can such projects use models from other projects?(= cross-project prediction)

‣ When and to which extent does cross-project prediction work?

Type to enter text

A first experiment: Firefox and IE

‣ Q1. To which extent is it possible to use cross-project data for defect prediction?

‣ Q2. Which kinds of systems are good predictors? What is the influence of data, domain, company, and process?

Firefox IE

precision=0.76; recall=0.88

precision=0.54; recall=0.04

Firefox can predict defects in IE.

!But IE cannot predict Firefox. WHY?

Type to enter text

Experiment outline

‣ 12 systems with 28 datasets ‣ different versions ‣ different levels of analysis (components, files)

‣ Run all 622 cross-project combinations ‣ for example Firefox and IE is one combination ‣ then train model from Firefox data, test on IE ‣ ignore invalid combinations, e.g., do not train from Eclipse 3.0

and test on 2.1

Systems studied

Project Total LOC Total Churn

Firefox 3.2 – 3.3 MLOC 0.64 – 0.95 MLOC

Internet Explorer 2.30 MLOC 2.20 MLOC

Direct-X 1.50 MLOC 1.00 MLOC

Internet Information Services (IIS) 2.00 MLOC 1.20 MLOC

Clustering 0.65 MLOC 0.84 MLOC

Printing 2.40 MLOC 2.20 MLOC

File system 2.00 MLOC 2.20 MLOC

Kernel 1.90 MLOC 3.20 MLOC

SQL Server 2005 4.6 MLOC 7.2 MLOC

Eclipse 0.79 – 1.3 MLOC 1.0 - 2.1 MLOC

Apache Derby 0.49 – 0.53 MLOC 4 – 23 KLOC

Apache Tomcat 0.25 – 0.26 MLOC 8 – 98 KLOC

Type to enter text

Data used in prediction models

‣ Relative code measures on churn, complexity and pre-release bugs ‣ Added LOC / Total LOC ‣ Deleted LOC / Total LOC ‣ Modified LOC / Total LOC ‣ (Added + deleted + modified LOC) / (Commits + 1) ‣ Cyclomatic Complexity / Total LOC ‣ Pre-release bugs / Total LOC

Successful cross-project

Data vs. Domain vs. Process

Type to enter text

Issues & Actions

‣ Issues: ‣ Privacy and Utility in Cross-Project Defect Prediction !

‣ Actionable?


Method-level bug prediction

Emanuel Giger, Martin Pinzger, Harald Gall

Type to enter text

!H1: Fine-grained code changes exhibit a stronger correlation with the number of bugs.

!H2: Fine-grained code changes achieve better performance to classify source files into bug- and not bug-prone files.

Code Churn VS. fine-grained changes

Type to enter text

Data set

‣ 15 Eclipse Plugins ‣ ~850’000 fine-grained source code changes (SCC) ‣ ~9’700’000 lines modified (= Code Churn) ‣ ~9 years of development history..... ‣ and enough bugs ;-)

SCC correlation with bugs

• Entire dataset • Spearman rank corr. • Code Churn: 0.66 • Fine-grained chang.: 0.77 • Non-param. test, α=0.05 • Significant

Type to enter text

Performance for Bug-Proneness

Logistic Regression

AUC: 0.90

AUC: 0.85

Fine-grained changes

Code Churn

Non-param. test α=0.05 was significant

Bug-Prone

Not Bug-Prone

Type to enter text

Method-Level Prediction

11 methods on average

focus on few that are bug prone

class 1 class 2 class 3 class n...class 2

Retrieving bug-prone methods can help to save manual inspection and to improve testing effort allocation

Layout

H: Using fine-grained changes we can predict bug localities on method-level. Q: To which extent is it possible to build method-level prediction models using change and source code metrics?

Commit Message 1.3: “Bug #007: self-references fixed [...]”

Eclipse, Azureus, Jena, OpenXava, Apache

projects

Study

Fine-grained change measures and

code metrics

Random Forest Bug-prone

Not Bug-prone

AUC: 0.9, Prec.: 0.84, Rec.: 0.82

47

Type to enter text

Apache Derby

‣ Derby Release 10.2.2.0 ‣ 12% bug-prone methods (post-release) ‣ AUC 0.9, P= 0.53, R= 0.7 ‣ Class TernaryOperatorNode:

‣ 30 methods ‣ 6 bug-prone methods with fixes in release 10.3.1.4 !

‣ Actionable?

48


What about Performance of Defect Prediction Models?

Jayalath Ekanayake, Jonas Tappolet, Abraham Bernstein, Harald Gall

Type to enter text

Prediction Quality vs. Time

‣ Eclipse heat-map: Prediction quality on same target using different training periods with the point of highest AUC highlighted

Periods of Stability and Drift

Type to enter text

Studies and Issues

‣ Bug predictions do work ‣ Cross-project predictions do not really work ‣ Data sets (systems) need to be “harmonized” ‣ Data preprocessing and learners need to be

calibrated ‣ Studies need to be replicable (systematically) ‣ Periods of stability vs. drift

Type to enter text

Future of MSR?

Vision statement Status open

Answer Commonly Asked Project Questions Michael W. Godfrey partly

Software Repositories: A Strategic Asset Ahmed E. Hassan yes

Create Centralized Data Repositories James Herbsleb yes

Embed Mining in Developer Tools Gail C. Murphy partly

Help Developers Search for Information Martin Robillard partly

Deploy Mining to Industry Audris Mockus ongoing

Let Us Not Mine for Fool’s Gold David Notkin ongoing

based on a Software Roundtable published in IEEE Software 2009

systematic mining of software actionable...

Documents