keynote hotswup 2012

44
Will my system run (correctly) after the upgrade? Martin Pinzger Assistant Professor Delft University of Technology

Upload: martin-pinzger

Post on 29-Jun-2015

340 views

Category:

Education


0 download

DESCRIPTION

Keynote for the Fourth Workshop on Hot Topics in Software Upgrades, co-located with ICSE 2012, Zurich, Switzerland

TRANSCRIPT

Page 1: Keynote HotSWUp 2012

Will my system run (correctly) after the upgrade?

Martin PinzgerAssistant ProfessorDelft University of Technology

Page 2: Keynote HotSWUp 2012

2Pfunds

Martin’s upgrades

PhD

Postdoc

Assistant Professor

Page 3: Keynote HotSWUp 2012

My Experience with Software Upgrades

3

Page 4: Keynote HotSWUp 2012

4

Page 5: Keynote HotSWUp 2012

5

Page 6: Keynote HotSWUp 2012

6

Bugs on upgrades get reported

Page 7: Keynote HotSWUp 2012

Hmm, wait a minute

7

Can’t we learn “something” from that data?

Page 8: Keynote HotSWUp 2012

Software repository mining for preventing upgrade failures

Martin PinzgerAssistant ProfessorDelft University of Technology

Page 9: Keynote HotSWUp 2012

Goal of software repository mining

Making the information stored in software repositories available to software developers

Quality analysis and defect prediction

Recommender systems

...

9

Page 10: Keynote HotSWUp 2012

Software repositories

10

Page 11: Keynote HotSWUp 2012

Examples from my mining research

Predicting failure-prone source files using changes (MSR 2011)

The relationship between developer contributions and failures (FSE 2008)

There are many more studiesMSR 2012 http://2012.msrconf.org/

A survey and taxonomy of approaches for mining software repositories in the context of software evolution, Kagdi et al. 2007

11

Page 12: Keynote HotSWUp 2012

Using Fine-Grained Source Code Changes for Bug Prediction

Joint work with Emanuel Giger, Harald GallUniversity of Zurich

Page 13: Keynote HotSWUp 2012

Bug prediction

Goal Train models to predict the bug-prone source files of the next release

HowUsing product measures, process measures, organizational measures with machine learning techniques

Many existing studies on building prediction modelsMoser et al., Nagappan et al., Zimmermann et al., Hassan et al., etc.

Process measures performed particularly well

13

Page 14: Keynote HotSWUp 2012

Classical change measures

Number of file revisions

Code Churn aka lines added/deleted/changed

Research question of this study: Can we further improve these models?

14

Page 15: Keynote HotSWUp 2012

Revisions are coarse grained

What did change in a revision?

15

Page 16: Keynote HotSWUp 2012

Code Churn can be imprecise

16

Extra changes not relevant for locating bugs

Page 17: Keynote HotSWUp 2012

Fine Grained-Source Code Changes (SCC)

THEN

MI

IF "balance > 0"

"withDraw(amount);"

Account.java 1.5

THEN

MI

IF

"balance > 0 && amount <= balance"

"withDraw(amount);"

ELSE

MI

notify();

Account.java 1.6

3 SCC: 1x condition change, 1x else-part insert, 1x invocation statement insert

17

Page 18: Keynote HotSWUp 2012

Research hypotheses

18

H1 SCC is correlated with the number of bugs in source files

H2 SCC is a predictor for bug-prone source files (and outperforms LM)

H3 SCC is a predictor for the number of bugs in source files (and outperforms LM)

Page 19: Keynote HotSWUp 2012

15 Eclipse plug-ins

Data>850’000 fine-grained source code changes (SCC)

>10’000 files

>9’700’000 lines modified (LM)

>9 years of development history

..... and a lot of bugs referenced in commit messages

19

Page 20: Keynote HotSWUp 2012

H1: SCC is correlated with #bugsTable 4: Non parametric Spearman rank correlation ofbugs, LM ,and SCC . * marks significant correlations at� = 0.01. Larger values are printed bold.

Eclipse Project LM SCCCompare 0.68� 0.76 �

jFace 0.74� 0.71�JDT Debug 0.62� 0.8�Resource 0.75� 0.86�Runtime 0.66� 0.79�Team Core 0.15� 0.66�CVS Core 0.60� 0.79�Debug Core 0.63� 0.78�jFace Text 0.75� 0.74�Update Core 0.43� 0.62�Debug UI 0.56� 0.81�JDT Debug UI 0.80� 0.81�Help 0.54� 0.48�JDT Core 0.70� 0.74�OSGI 0.70� 0.77�

Median 0.66 0.77

for JDT Core.We used a Related Samples Wilcoxon Signed-Ranks Test on

the values of the columns in Table 4. The rationale is that (1)we calculated both correlations for each project resulting ina matched correlation pair per project and (2) we can relaxany assumption about the distribution of the values. The testwas significant at � = 0.01 rejecting the null hypothesis thattwo medians are the same. Based on these results we acceptH 2—SCC do have a stronger correlation with bugs than codechurn based on LM .

3.4 Correlation Analysis of Change Types &Bugs

For the correlation analysis in the previous Section 3.3 wedid not distinct between the different categories of the changetypes. We treated them equally and related the total numberof SCC to bugs. On advantage of SCC over pure line basedcode churn is that we can determine the exact change opera-tion down to statement level and assign it to the source codeentity that actually changed. In this section we analyze thecorrelation between bugs and the categories we defined inSection 3.1. The goal is to see whether there are differencesin how certain change types correlate with bugs.

Table 5 shows the correlation between the different cate-gories and bugs for each project. We counted for each file ofa project the number of changes within each category andthe number of bugs and related both numbers by correla-tion. Regarding their mean the highest correlation with bugshave stmt, func, and mDecl. They furthermore exhibit valuesfor some projects that are close or above 0.7 and are consid-ered strong, e.g., func for Resource or JDT Core; mDecl forResource and JDT Core; stmt for JDT Debug UI and DebugUI. oState and cond still have substantial correlation in aver-age but their means are marginal above 0.5. cDecl and elsehave means below 0.5. With some exceptions, e.g., Comparethey show many correlation values below 0.5. This indicatesthat change types do correlate differently with bugs in ourdataset. A Related Samples Friedman Test was significant at� = 0.05 rejecting the null hypothesis that the distribution ofthe correlation values of SCC categories, i.e., rows in Table 5are the same. The Friedman Test operates on the mean ranksof related groups. We used this test because we repeatedlymeasured the correlations of the different categories on thesame dataset, i.e., our related groups and because it does not

Table 5: Non parametric Spearman rank correlation of bugsand categories of SCC . * marks significant correlations at� = 0.01.

Eclipse Project cDecl oState func mDecl stmt cond elseCompare 0.54� 0.61� 0.67� 0.61� 0.66� 0.55� 0.52�jFace 0.41� 0.47� 0.57� 0.63� 0.66� 0.51� 0.48�Resource 0.49� 0.62� 0.7� 0.73� 0.67� 0.49� 0.46�Team Core 0.44� 0.43� 0.56� 0.52� 0.53� 0.36� 0.35�CVS Core 0.39� 0.62� 0.66� 0.57� 0.72� 0.58� 0.56�Debug Core 0.45� 0.55� 0.61� 0.51� 0.59� 0.45� 0.46�Runtime 0.47� 0.58� 0.66� 0.61� 0.66� 0.55� 0.45�JDT Debug 0.42� 0.45� 0.56� 0.55� 0.64� 0.46� 0.44�jFace Text 0.50� 0.55� 0.54� 0.64� 0.62� 0.59� 0.55�JDT Debug UI 0.46� 0.57� 0.62� 0.53� 0.74� 0.57� 0.54�Update Core 0.63� 0.4� 0.43� 0.51� 0.45� 0.38� 0.39�Debug UI 0.44� 0.50� 0.63� 0.60� 0.72� 0.54� 0.52�Help 0.37� 0.43� 0.42� 0.43� 0.44� 0.36� 0.41�OSGI 0.47� 0.6� 0.66� 0.65� 0.63� 0.57� 0.48�JDT Core 0.39� 0.6� 0.69� 0.70� 0.67� 0.62� 0.6�

Mean 0.46 0.53 0.6 0.59 0.63 0.51 0.48

make any assumption about the distribution of the data andthe sample size.

A Related Samples Friedman Test is a global test that onlytests whether all of the groups differ. It does not tell anythingbetween which groups the difference occurs. However thevalues in Table 5 show that when comparing pairwise somemeans are closer than others. For instance func vs. mDecl andfunc vs. cDecl. To test whether some pairwise groups differstronger than others or do not differ at all post-hoc tests arerequired. We performed a Wilcoxon Test and Friedman Test oneach pair. Figure 2 shows the results of the pairwise post-hoc tests. Dashed lines mean that both tests reject their H0,i.e., the row values of those two change types do significantlydiffer; a straight line means both tests retain their H0, i.e., therow values of those change type do not significantly differ;a dotted line means only one test is significant, and it is dif-ficult to say whether the values of these rows differ signifi-cantly.

When testing post-hoc several comparisons in the contextof the result of a global test–the afore Friedman Test–it is morelikely that we fall for a Type 1 Error when agreeing upon sig-nificance. In this case either the significance probability mustbe adjusted, i.e., raised or the �-level must be adjusted, i.e.,lowered [8]. For the post-hoc tests in Figure 2 we adjusted the�-level using the Bonferroni-Holm procedure [34]. In Figure 2we can identify two groups where the categories are con-nected with a straight line among each other: (1) else,cond,oState,and cDecl, and (2) stmt, func, and mDecl. The correlation val-ues of the change types within these groups do not differsignificantly in our dataset. These findings are of more in-terest in the context of Table 2. Although func and mDecloccur much less frequently than stmt they correlate evenlywith bugs. The mass of rather small and local statementschanges correlates as evenly as the changes of functionalityand of method declarations that occur relatively sparse. Thesituation is different in the second group where all changetypes occur with more or less the same relative low frequencygigs .Mention/discuss partial correlation?/ . We use the results

and insights of the correlation analysis in Section 3.5 andSection 3.6 when we build prediction model to investigatewhether SCC and change types are adequate to predict bugsin our dataset. gigs .Show some examples of added methods thatwere later very buggy?/

*significant correlation at 0.01

20

+/-0.5 substantial+/-0.7 strong

Page 21: Keynote HotSWUp 2012

Predicting bug-prone files

Bug-prone vs. not bug-prone

Figure 2: Scatterplot between the number of bugs andnumber of SCC on file level. Data points were obtainedfor the entire project history.

3.5 Predicting Bug- & Not Bug-Prone FilesThe goal of H 3 is to analyze if SCC can be used to dis-

criminate between bug-prone and not bug-prone files in ourdataset. We build models based on different learning tech-niques. Prior work states some learners perform better thanothers. For instance Lessman et al. found out with an ex-tended set of various learners that Random Forest performsthe best on a subset of the NASA Metrics dataset. But in re-turn they state as well that performance differences betweenlearners are marginal and not significant [20].

We used the following classification learners: Logistic Re-gression (LogReg), J48 (C 4.5 Decision Tree), RandomForest (Rnd-For), Bayesian Network (B-Net) implemented by the WEKAtoolkit [36], Exhaustive CHAID a Decision Tree based on chisquared criterion by SPSS 18.0, Support Vector Machine (Lib-SVM [7]), Naive Bayes Network (N-Bayes) and Neural Nets (NN)both provided by the Rapid Miner toolkit [24]. The classifierscalculate and assign a probability to a file if it is bug-prone ornot bug-prone.

For each Eclipse project we binned files into bug-prone andnot bug-prone using the median of the number of bugs per file(#bugs):

bugClass =

⇢not bug � prone : #bugs <= median

bug � prone : #bugs > median

When using the median as cut point the labeling of a file isrelative to how much bugs other files have in a project. Thereexist several ways of binning files afore. They mainly vary inthat they result in different prior probabilities: For instanceZimmerman et al. [40] and Bernstein et al. [4] labeled files asbug-prone if they had at least one bug. When having heavilyskewed distributions this approach may lead to high a priorprobability towards a one class. Nagappan et al. [28] used astatistical lower confidence bound. The different prior prob-abilities make the use of accuracy as a performance measurefor classification difficult.

As proposed in [20, 23] we therefore use the area underthe receiver operating characteristic curve (AUC) as perfor-mance measure. AUC is independent of prior probabilitiesand therefore a robust measure to asses the performance andaccuracy of predictor models [4]. AUC can be seen as theprobability, that, when choosing randomly a bug-prone and

Table 6: AUC values of E 1 using logistic regression withLM and SCC as predictors for bug-prone and a not bug-

prone files. Larger values are printed in bold.Eclipse Project AUC LM AUC SCCCompare 0.84 0.85jFace 0.90 0.90JDT Debug 0.83 0.95Resource 0.87 0.93Runtime 0.83 0.91Team Core 0.62 0.87CVS Core 0.80 0.90Debug Core 0.86 0.94jFace Text 0.87 0.87Update Core 0.78 0.85Debug UI 0.85 0.93JDT Debug UI 0.90 0.91Help 0.75 0.70JDT Core 0.86 0.87OSGI 0.88 0.88Median 0.85 0.90Overall 0.85 0.89

a not bug-prone file the trained model then assigns a higherscore to the bug-prone file [16].

We performed two bug-prone vs. not bug-prone classifica-tion experiments: In experiment 1 (E 1) we used logistic re-gression once with total number of LM and once with totalnumber of SCC as predictors. E 1 investigates H 3–SCC canbe used to discriminate between bug- and not bug-prone files–and in addition whether SCC is a better predictor than codechurn based on LM .

Secondly in experiment 2 (E 2) we used the above men-tioned classifiers and the number of each category of SCCdefined in Section 3.1 as predictors. E 2 investigates whetherchange types are good predictors and if the additional typeinformation yields better results than E 1 where the type of achange is neglected. In the following we discuss the resultsof both experiments:Experiment 1: Table 6 lists the AUC values of E 1 for eachproject in our dataset. The models were trained using 10 foldcross validation and the AUC values were computed whenreapplying a learned model on the dataset it was obtainedfrom. Overall denotes the model that was learned when merg-ing all files of the projects into one larger dataset. SCC achievesa very good performance with a median of 0.90–more thanhalf of the projects have AUC values equal or higher than0.90. This means that logistic regression using SCC as predic-tor ranks bug-prone files higher than not bug-prone ones witha probability of 90%. Even Help having the lowest value isstill within the range of 0.7 what Lessman et al. call ”promis-ing results” [20]. This low value is accompanied with thesmallest correlation of 0.48 of SCC in Table 4. The good per-formance of logistic regression and SCC is confirmed by anAUC value of 0.89 when learning from the entire dataset.With a value of 0.004 AUCSCChas a low variance over allprojects indicating consistent models. Based on the results ofE 1 we accept H 3—SCC can be used to discriminate betweenbug- and not bug-prone files.

With a median of 0.85 LM shows a lower performance thanSCC . Help is the only case where LM is a better predictorthan SCC . This not surprising as it is the project that yieldsthe largest difference in favor of LM in Table 4. In general thecorrelation values in Table 4 reflect the picture given by theAUC values. For instance jFace Text and JDT Debug UI thatexhibit similar correlations performed nearly equal. A Re-

21

Page 22: Keynote HotSWUp 2012

H2: SCC can predict bug-prone files

Figure 2: Scatterplot between the number of bugs andnumber of SCC on file level. Data points were obtainedfor the entire project history.

3.5 Predicting Bug- & Not Bug-Prone FilesThe goal of H 3 is to analyze if SCC can be used to dis-

criminate between bug-prone and not bug-prone files in ourdataset. We build models based on different learning tech-niques. Prior work states some learners perform better thanothers. For instance Lessman et al. found out with an ex-tended set of various learners that Random Forest performsthe best on a subset of the NASA Metrics dataset. But in re-turn they state as well that performance differences betweenlearners are marginal and not significant [20].

We used the following classification learners: Logistic Re-gression (LogReg), J48 (C 4.5 Decision Tree), RandomForest (Rnd-For), Bayesian Network (B-Net) implemented by the WEKAtoolkit [36], Exhaustive CHAID a Decision Tree based on chisquared criterion by SPSS 18.0, Support Vector Machine (Lib-SVM [7]), Naive Bayes Network (N-Bayes) and Neural Nets (NN)both provided by the Rapid Miner toolkit [24]. The classifierscalculate and assign a probability to a file if it is bug-prone ornot bug-prone.

For each Eclipse project we binned files into bug-prone andnot bug-prone using the median of the number of bugs per file(#bugs):

bugClass =

⇢not bug � prone : #bugs <= median

bug � prone : #bugs > median

When using the median as cut point the labeling of a file isrelative to how much bugs other files have in a project. Thereexist several ways of binning files afore. They mainly vary inthat they result in different prior probabilities: For instanceZimmerman et al. [40] and Bernstein et al. [4] labeled files asbug-prone if they had at least one bug. When having heavilyskewed distributions this approach may lead to high a priorprobability towards a one class. Nagappan et al. [28] used astatistical lower confidence bound. The different prior prob-abilities make the use of accuracy as a performance measurefor classification difficult.

As proposed in [20, 23] we therefore use the area underthe receiver operating characteristic curve (AUC) as perfor-mance measure. AUC is independent of prior probabilitiesand therefore a robust measure to asses the performance andaccuracy of predictor models [4]. AUC can be seen as theprobability, that, when choosing randomly a bug-prone and

Table 6: AUC values of E 1 using logistic regression withLM and SCC as predictors for bug-prone and a not bug-

prone files. Larger values are printed in bold.Eclipse Project AUC LM AUC SCCCompare 0.84 0.85jFace 0.90 0.90JDT Debug 0.83 0.95Resource 0.87 0.93Runtime 0.83 0.91Team Core 0.62 0.87CVS Core 0.80 0.90Debug Core 0.86 0.94jFace Text 0.87 0.87Update Core 0.78 0.85Debug UI 0.85 0.93JDT Debug UI 0.90 0.91Help 0.75 0.70JDT Core 0.86 0.87OSGI 0.88 0.88Median 0.85 0.90Overall 0.85 0.89

a not bug-prone file the trained model then assigns a higherscore to the bug-prone file [16].

We performed two bug-prone vs. not bug-prone classifica-tion experiments: In experiment 1 (E 1) we used logistic re-gression once with total number of LM and once with totalnumber of SCC as predictors. E 1 investigates H 3–SCC canbe used to discriminate between bug- and not bug-prone files–and in addition whether SCC is a better predictor than codechurn based on LM .

Secondly in experiment 2 (E 2) we used the above men-tioned classifiers and the number of each category of SCCdefined in Section 3.1 as predictors. E 2 investigates whetherchange types are good predictors and if the additional typeinformation yields better results than E 1 where the type of achange is neglected. In the following we discuss the resultsof both experiments:Experiment 1: Table 6 lists the AUC values of E 1 for eachproject in our dataset. The models were trained using 10 foldcross validation and the AUC values were computed whenreapplying a learned model on the dataset it was obtainedfrom. Overall denotes the model that was learned when merg-ing all files of the projects into one larger dataset. SCC achievesa very good performance with a median of 0.90–more thanhalf of the projects have AUC values equal or higher than0.90. This means that logistic regression using SCC as predic-tor ranks bug-prone files higher than not bug-prone ones witha probability of 90%. Even Help having the lowest value isstill within the range of 0.7 what Lessman et al. call ”promis-ing results” [20]. This low value is accompanied with thesmallest correlation of 0.48 of SCC in Table 4. The good per-formance of logistic regression and SCC is confirmed by anAUC value of 0.89 when learning from the entire dataset.With a value of 0.004 AUCSCChas a low variance over allprojects indicating consistent models. Based on the results ofE 1 we accept H 3—SCC can be used to discriminate betweenbug- and not bug-prone files.

With a median of 0.85 LM shows a lower performance thanSCC . Help is the only case where LM is a better predictorthan SCC . This not surprising as it is the project that yieldsthe largest difference in favor of LM in Table 4. In general thecorrelation values in Table 4 reflect the picture given by theAUC values. For instance jFace Text and JDT Debug UI thatexhibit similar correlations performed nearly equal. A Re-

22

SCC outperforms LM

Page 23: Keynote HotSWUp 2012

Predicting the number of bugs

Non linear regression with asymptotic model:

23#SCC

40003000200010000

#Bugs

6 0

40

20

0

Page 1

Team Core

f(#Bugs) = a1 + b2*eb3*SCC

Page 24: Keynote HotSWUp 2012

H3: SCC can predict the number of bugsTable 8: Results of the nonlinear regression in terms of R2

and Spearman correlation using LM and SCC as predictors.Project R2

LM R2SCC SpearmanLM SpearmanSCC

Compare 0.84 0.88 0.68 0.76jFace 0.74 0.79 0.74 0.71JDT Debug 0.69 0.68 0.62 0.8Resource 0.81 0.85 0.75 0.86Runtime 0.69 0.72 0.66 0.79Team Core 0.26 0.53 0.15 0.66CVS Core 0.76 0.83 0.62 0.79Debug Core 0.88 0.92 0.63 0.78Jface Text 0.83 0.89 0.75 0.74Update Core 0.41 0.48 0.43 0.62Debug UI 0.7 0.79 0.56 0.81JDT Debug UI 0.82 0.82 0.8 0.81Help 0.66 0.67 0.54 0.84JDT Core 0.69 0.77 0.7 0.74OSGI 0.51 0.8 0.74 0.77Median 0.7 0.79 0.66 0.77Overall 0.65 0.72 0.62 0.74

of the models, i.e., an accompanied increase/decrease of theactual and the predicted number of bugs.

With an average R2LM of 0.7, LM has less explanatory pow-

er compared to SCC using an asymptotic model. Except forthe case of JDT Debug UI having equal values, LM performslower than SCC for all projects including Overall. The Re-lated Samples Wilcoxon Signed-Ranks Test on the R2 values ofLM and SCC in Table 8 was significant, denoting that the ob-served differences in our dataset are significant.

To asses the validity of a regression model one must pay at-tention to the distribution of the error terms. Figure 3 showstwo examples of fit plots with normalized residuals (y-axis)and predicted values (x-axis) of our dataset: The plot of theregression model of the Overall dataset on the left side andthe one of Debug Core having the highest R2

SCC value onthe right side. On the left side, one can spot a funnel whichis one of the ”archetypes” of residual plots and indicates thatthe constance-variance assumption may be violated, i.e., thevariability of the residuals is larger for larger predicted val-ues of SCC [19]. This is an example of a model that showsan adequate performance, i.e., R2

SCC of 0.72, but where thevalidity is questionable. On the right side, there is a first signof the funnel pattern but it is not as evident as on the leftside. The lower part of Figure 3 shows the corresponding his-togram charts of the residuals. They are normally distributedwith a mean of 0.

Therefore, we accept H 3–SCC (using asymptotic nonlin-ear regression) achieves better performance when predictingthe number of bugs within files than LM. However one mustbe careful to investigate wether the models violate the as-sumptions of the general regression model. We analyzed allresidual plots of our dataset and found that the constance-variance assumption may be generally problematic, in par-ticular when analyzing software measures and open sourcesystems that show highly skewed distributions. The othertwo assumptions concerning the error terms, i.e., zero meanand independence, are not violated. When using regressionstrictly for descriptive and prediction purposes only, as itis the case for our experiments, these assumptions are lessimportant, since the regression will still result in an unbi-ased estimate between the dependent and independent vari-able [19]. However, when inference based on the obtainedregression models is made, e.g., conclusions about the slope

Predicted Values (Overall)250.00200.00150.00100.0050.00.00

nrm.

Res

iduals

1.50

1.00

.50

.00

- . 50

-1 .00

Predicted Values (Debug Core)200.00150.00100.0050.00.00

nrm.

Res

iduals

1.00

.50

.00

- . 50

-1 .00

nrm. Residuals (Overall)1.501.00.50.00- . 50-1 .00

6,000.0

5,000.0

4,000.0

3,000.0

2,000.0

1,000.0

.0

nrm. Residuals (Debug Core)1.00.50.00- . 50-1 .00

200.0

150.0

100.0

50.0

.0

Figure 3: Fit plots of the Overall dataset (left) and DebugCore (right) with normalized residuals on the y-axis andthe predicted values on the x-axis. Below are the corre-sponding histograms of the residuals.

(� coefficients) or the significance of the entire model itself,the assumptions must be verified.

3.6 Summary of ResultsThe results of our empirical study can be summarized as

follows:SCC correlates strongly with Bugs . With an average Spear-man rank correlation of 0.77, SCC has a strong correlationwith the number of bugs in our dataset. Statistical tests in-dicated that the correlation of SCC and Bugs is significantlyhigher than between LM and Bugs (accepted H 1).SCC categories correlate differently with Bugs . Except forcDecl all SCC categories defined in Section 3.1 correlate sub-stantially with Bugs. A Friedman Test revealed that the cate-gories have significantly different correlations. Post-hoc com-parisons confirmed that the difference is mainly because oftwo groups of categories: (1) stmt, func, and mDecl, and (2)else, cond, oState, and cDecl. Within these groups the post-hoctests were not significant.SCC is a strong predictor for classifying source files intobug-prone and not bug-prone. Models built with logistic re-gression and SCC as predictor rank bug-prone files higher thannot bug-prone with an average probability of 90%. They havea significant better performance in terms of AUC than logis-tic regression models built with LM as a predictor (acceptedH 2).

In a series of experiments with different classifiers usingSCC categories as independent variables, LibSVM yieldedthe best performance—it was the best classifier for more thanhalf of the projects. LibSVM was closely followed by BNet,RFor, NBayes, and NN. Decision tree learners resulted in asignificantly lower performance. Furthermore, using cate-gories, e.g., func, rather than the total number of SCC did notyield better performance.

24

SCC outperforms LM

Page 25: Keynote HotSWUp 2012

Summary of results

SCC performs significantly better than LMAdvanced learners are not always better

Change types do not yield extra discriminatory power

Predicting the number of bugs is “possible”

More information“Comparing Fine-Grained Source Code Changes And Code Churn For Bug Prediction”, MSR 2011

25

Page 26: Keynote HotSWUp 2012

What is next?

Analysis of the effect(s) of changesWhat is the effect on the design?

What is the effect on the quality?

Ease understanding of changes

Recommender techniquesModels that can provide feedback on the effects

26

Page 27: Keynote HotSWUp 2012

27

Page 28: Keynote HotSWUp 2012

Can developer-module networks predict failures?

Joint work with Nachi Nagappan, Brendan MurphyMicrosoft Research

Page 29: Keynote HotSWUp 2012

Research question

29

Are binaries with fragmented contributions from many developers more likely to have post-release failures?

Should developers focus on one thing?

Page 30: Keynote HotSWUp 2012

Study with MS Vista project

DataReleased in January, 2007

> 4 years of development

Several thousand developers

Several thousand binaries (*.exe, *.dll)

Several millions of commits

30

Page 31: Keynote HotSWUp 2012

$OLFH

%RE

'DQ

(ULF

)X

*R

+LQ

DE

F

Approach in a nutshell

31

Change

Logs

Bugs

Regression Analysis

Validation with data splitting

Alice

Dan

Eric Go

Hin c

5

4

6

2

5 7

4

a4

Bob2b

6

Fu

Binary #bugs #centrality

a 12 0.9

b 7 0.5

c 3 0.2

Page 32: Keynote HotSWUp 2012

Contribution network

32

$OLFH

%RE

'DQ

(ULF

)X

*R

+LQ

DE

F

Windows binary (*.dll)Developer

Which binary is failure-prone?

Page 33: Keynote HotSWUp 2012

Measuring fragmentation

33

$OLFH

%RE

'DQ

(ULF

)X

*R

+LQ

DE

F

Freeman degree

$OLFH

%RE

'DQ

(ULF

)X

*R

+LQ

DE

F

$OLFH

%RE

'DQ

(ULF

)X

*R

+LQ

DE

F

Bonacich’s powerCloseness

$OLFH

%RE

'DQ

(ULF

)X

*R

+LQ

DE

F

Page 34: Keynote HotSWUp 2012

Research hypotheses

34

H1 Binaries with fragmented contributions are failure-prone

H2 Fragmentation correlates positively with the number of post-release failures

H3 Advanced fragmentation measures improve failure estimation

Page 35: Keynote HotSWUp 2012

Correlation analysis

35

nrCommits nrAuthors Power dPower Closeness Reach Betweenness

Failures 0,700 0,699 0,692 0,740 0,747 0,746 0,503

nrCommits 0,704 0,996 0,773 0,748 0,732 0,466

nrAuthors 0,683 0,981 0,914 0,944 0,830

Power 0,756 0,732 0,714 0,439

dPower 0,943 0,964 0,772

Closeness 0,990 0,738

Reach 0,773

Spearman rank correlation

All correlations are significant at the 0.01 level (2-tailed)

Page 36: Keynote HotSWUp 2012

H1: Predicting failure-prone binaries

36

Binary logistic regression of 50 random splits4 principal components from 7 centrality measures

40200

1.00

0.90

0.80

0.70

0.60

0.50

40200

1.00

0.90

0.80

0.70

0.60

0.50

40200

1.00

0.90

0.80

0.70

0.60

0.50

Precision Recall AUC

Page 37: Keynote HotSWUp 2012

H2: Predicting the number of failures

37

All correlations are significant at the 0.01 level (2-tailed)40200

1.00

0.90

0.80

0.70

0.60

0.50

40200

1.00

0.90

0.80

0.70

0.60

0.50

40200

1.00

0.90

0.80

0.70

0.60

0.50

R-Square Pearson Spearman

Linear regression of 50 random splits#Failures = b0 + b1*nCloseness + b2*nrAuthors + b3*nrCommits

Page 38: Keynote HotSWUp 2012

H3: Basic vs. advanced measures

38

40200

1.00

0.90

0.80

0.70

0.60

0.50

0.40

0.30

40200

1.00

0.90

0.80

0.70

0.60

0.50

0.40

0.30

40200

1.00

0.90

0.80

0.70

0.60

0.50

0.40

0.30

40200

1.00

0.90

0.80

0.70

0.60

0.50

0.40

0.30

R-Sq

uare

Spearm

anModel with nrAuthors, nrCommits

Model with nCloseness, nrAuthors, nrCommits

Page 39: Keynote HotSWUp 2012

Summary of results

39

Centrality measures can predict more than 83% of failure-pone Vista binaries

Closeness, nrAuthors, and nrCommits can predict the number of post-release failures

Closeness or Reach can improve prediction of the number of post-release failures by 32%

More informationCan Developer-Module Networks Predict Failures?, FSE 2008

Page 40: Keynote HotSWUp 2012

What can we learn from that?

Increase testing effort for central binaries? - yes

Re-factor central binaries? - maybe

Re-organize contributions? - maybe 40

$OLFH

%RE

'DQ

(ULF

)X

*R

+LQ

DE

F

5

4

6

2 4

6

2

5 7

4

Page 41: Keynote HotSWUp 2012

What is next?

Analysis of the contributions of a developerWho is working on which parts of the system?

What exactly is the contribution of a developer?

Who is introducing bugs/smells and how can we avoid it?

Global distributed software engineeringWhat are the contributions of teams, smells and how to avoid it?

Can we empirically prove Conway’s Law?

Expert recommendationWhom to ask for advice on a piece of code?

41

Page 42: Keynote HotSWUp 2012

42

Ideas for software upgrade research

1. Mining software repositories to identify the upgrade-critical components

What are the characteristics of such components?

Product and process measures

What are the characteristics of the target environments?

Hardware, operating system, configuration

Train a model with these characteristics and reported bugs

Page 43: Keynote HotSWUp 2012

Further ideas for research

Who is upgrading which applications when?Study upgrade behavior of users?

What is the environment of the users when they upgrade?Where did it work, where did it fail?

Collect crash reports for software upgrades?

Upgrades in distributed applications?Finding the optimal time when to upgrade which component?

43

Page 44: Keynote HotSWUp 2012

Conclusions

44

#SCC40003000200010000

#Bugs

6 0

40

20

0

Page 1

Team Core

$OLFH

%RE

'DQ

(ULF

)X

*R

+LQ

DE

F

5

4

6

2 4

6

2

5 7

4

Questions?

Martin [email protected]