can we induce change with what we measure?

Data-driven software engineering @Microsoft

Michaela Greiler

Data-driven software engineering @Microsoft• How can we optimize the testing process?

• Do code reviews make a difference?

• Is coding velocity and quality always a tradeoff?

• What’s the optimal way to organize work on a large team?

MSR Redmond/TSE:Michaela GreilerJacek CzerwonkaWolfram SchulteSuresh Thummalapenta

MSR Redmond:Christian BirdKathryn McKinleyNachi NagappanThomas Zimmermann

MSR Cambridge:Brendan MurphyKim Herzig

http://research.microsoft.com/en-us/people/mgreiler/default.aspx

http://research.microsoft.com/en-us/people/jacekcz/default.aspx

http://research.microsoft.com/en-us/people/schulte/default.aspx

http://research.microsoft.com/en-us/people/suthumma/

http://research.microsoft.com/en-us/people/cbird/default.aspx

http://www.cs.utexas.edu/users/mckinley/

http://research.microsoft.com/en-us/people/nachin/default.aspx

http://research.microsoft.com/en-us/people/tzimmer/default.aspx

http://research.microsoft.com/en-us/people/bmurphy/default.aspx

http://research.microsoft.com/en-us/people/kimh/default.aspx

0

20

40

60

80

100

201020102011201120112011201120112011201120112011201120112012201220122012201220122012201220122012201220122013201320132013201320132013201320132013

11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10

Code Coverage trigger of Checkins

% completely covered % somewhat covered % not covered

Reviewer recommendation: Does experience matter?

Can we change with what we can measure?

Michaela Greiler

YES

that’s the danger!

What is measured?

0

1

2

3

4

5

6

7

8

Carl Lisa Rob Danny

Number Bugs

What is changed?

0

0.5

1

1.5

2

2.5

Carl Lisa Rob Danny

Number Bugs

Code Quality

SOCIO TECHNICAL CONGRUENCE “Design and programming are human activities; forget that and all is lost” – Bjarne Stroustrop

So should we go without any measurements?

InterpretationData Collection Usage

Lessons learned

No

Garbage!

• What is codemine? What data does codemine have?

GMQ vs. Opportunistic data collection

• Easily available ≠ what’s needed

• Determine the needed data

• Find proxy measures if needed

• Know the analysis before collecting the data

Otherwise, data is not usable for the intended purpose

• Goal – Question – Metric

• Check for completeness, cleanness/ noise and usefulness

• Data background• How was data generated?• Why was it generated?• Who consumes the data?• What about outliers?• How was the data processed?

Interpretation needs domain knowledge

Tools, processes,

practices and policies.

Release schedule

Time

Engi

ne

ers

What roles exist?Who does what?Responsibilities?

M1

M2

Beta

Organization of code bases

Team structure and culture.

You cannot compare 1:1

Engineers want to understand the nitty-gritty

• How do you calculate the recommended reviewers?

• Why was that person recommended?

• Why is Lisa not recommended?

Simplicity first

Fileswithout

bugs

Fileswithbugs

Files withoutbugs: main contributor

made > 50% of all edits

Files with bugs: main

contributor made < 60% of

all edits

Ownership metric:Proportion of edits of all edits for the contributor with the

most edits

Reporting vs. PredictionComprehension vs. automation

If you can do it with a decision tree… do it…

Iterative process with very close involvement of product teams and domain experts.

It’s a dialog It’s a back and forth

Mixed Method Research

Is a research approach or methodology

• for questions that call for real-life contextual understandings;

• employing rigorous quantitative research assessing magnitude and frequency of constructs and

• rigorous qualitative research exploring the meaning and understanding of constructs;

DR. MARGARET-ANNE STOREY

Professor of Computer Science University of Victoria

All methods are inherently flawed!

Generalizability

Precision Realism

DR. ARIE VAN DEURSEN

Professor of Software Engineering Delft University of Technology

http://margaretannestorey.wordpress.com/

http://www.st.ewi.tudelft.nl/~arie/

Foundations of Mixed

Methods Research

Designing

Social Inquiry

Qualitative Research: Mixed Method Research

• Interviews

• Observations

• Focus groups

• Contextual Inquiry

• Grounded Theory

• …

A Grounded Theory Study

23

Systematic procedure to discover a theory from (qualitative) data

S. Adolph, W. Hall, Ph. Kruchten. Using Grounded theory to study the experience of software development. Empirical Software Engineering, 2011.

B. Glaser and J. Holton. Remodeling grounded theory. Forum Qualitative Res., 2004.

Glaser and Strauss

Deductive versus inductive

A deductive approach is concerned with developing a hypothesis (or hypotheses) based on existing theory, and then designing a research

strategy to test the hypothesis (Wilson, 2010, p.7)

Inductive approach starts with observations. Theories emerge towards the end of the research and as a result of careful examination of

patterns in observations (Goddard and Melville, 2004).

Theory Hypotheses Observation Confirm/Reject

Observation Patterns Theory

All models are wrong but some are useful (George E. P. Box )

Theo: Test Effectiveness Optimization from History

Kim Herzig*, Michaela Greiler+

, Jacek Czerwonka+, Brendan Murphy*

*Microsoft Research, Cambridge +Microsoft Corporation, US

Improving Development Processes

Product / Service

Lega

cych

ange

s

New

pro

du

ctfe

atu

res

Tech

no

logy

chan

ges

Development Environment

$Speed

R

Cost

Quality / Risk

(should be well balanced)

Microsoft aims for shorter release cycles

Empirical data to support & drive decisions

• Speed up development processes (e.g. code velocity)• More frequent releases• Maintaining / increasing product quality

Joint effort by MSR & product teams• MSR Cambridge: Brendan Murphy, Kim Herzig• TSE Redmond: Jacek Czerwonka, Michaela Greiler• MSR Redmond: Tom Zimmermann, Chris Bird, Nachi Nagappan• Windows, Windows Phone, Office, Dynamics product teams

Software Testing for Windows

Winmain (main branch)

Quality gate

(system testing)

Quality gate

(system & component testing)

Quality gate

(component testing)

time

Development branch

Multiple area branches

Multiple component branches

Software testing is very expensive• Thousands test suites executed, millions test cases executed

• On different branches, architectures, languages, etc.

• We tend to repeat the same tests over and over again

• Too many false alarms (failures due to test and infrastructure issues)

• Each test failures slows down product development

• Aims to find code issues as early as possible

• At the cost of slower product development

Actual problem

Current process aims for maximal protection

{Simplified illustration}

Software Testing for Office

Software testing is very expensive• Thousands test suites executed, millions test cases executed

• On different branches, architectures, languages, etc.

• We tend to repeat the same tests over and over again

• Too many false alarms (failures due to test and infrastructure issues)

• Each test failures slows down product development

• Aims to find code issues as early as possible

• At the cost of slower product development

Actual problem

Current process aims for maximal protection

Dev Inner Loop

BVT and CVTon main

Dog food

Different • Branching structure• Development process• Testing process• Release schedules• …{Simplified illustration}

Goal

Reduce the number of test executions …

… without sacrificing code quality

Dynamic, self-adaptive optimization model

Solution

Reduce the number of test executions …

• Run every test at least once before integrating code change into main branch (e.g., winmain).

• We eventually find all code issues but take risk of finding them later (on higher level branches).

… without sacrificing code quality

High cost, unknown

value$$$$$

High cost, low value

$$$$

Low cost,low value

$

Low cost, good value

$$

How likely is a test causing:1) false positives or 2) finding code issues?

Analyze historic data:- Test Events- Builds- Code Integrations

Analyze past test results- Passing tests, false alarms, detected code issues

Bug finding capabilities change with context

Solution

Using cost function to model risk.

𝑪𝒐𝒔𝒕𝑬𝒙𝒆𝒄𝒖𝒕𝒊𝒐𝒏 > 𝑪𝒐𝒔𝒕𝑺𝒌𝒊𝒑 ? suspend ∶ execute test

𝐶𝑜𝑠𝑡𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 = 𝐶𝑜𝑠𝑡𝑀𝑎𝑐ℎ𝑖𝑛𝑒/𝑇𝑖𝑚𝑒 ∗ 𝑇𝑖𝑚𝑒𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 + "Cost of potential false alarm"

= 𝐶𝑜𝑠𝑡𝑀𝑎𝑐ℎ𝑖𝑛𝑒/𝑇𝑖𝑚𝑒 ∗ 𝑇𝑖𝑚𝑒𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 + (𝑃𝑟𝑜𝑏𝐹𝑃 ∗ 𝐶𝑜𝑠𝑡𝐷𝑒𝑣𝑒𝑙𝑜𝑝𝑒𝑟/𝑇𝑖𝑚𝑒 ∗𝑇𝑖𝑚𝑒𝑇𝑟𝑖𝑎𝑔𝑒 )

𝐶𝑜𝑠𝑡𝑆𝑘𝑖𝑝 = "Potential cost of finding a defect later"

= 𝑃𝑟𝑜𝑏𝑇𝑃 ∗ 𝐶𝑜𝑠𝑡𝐷𝑒𝑣𝑒𝑙𝑜𝑝𝑒𝑟/𝑇𝑖𝑚𝑒 ∗ 𝑇𝑖𝑚𝑒𝐹𝑟𝑒𝑒𝑧𝑒 𝑏𝑟𝑎𝑛𝑐ℎ ∗ #𝐷𝑒𝑣𝑒𝑙𝑜𝑝𝑒𝑟𝑠𝐵𝑟𝑎𝑛𝑐ℎ

Test

Cost to run a test.

Value of output.

Current Results

Simulated on Windows 8.1 development period (BVT only)

Dynamic, Self-Adaptive

Decision points are connected to each otherSkipping tests influences the risk factors of higher level branches

We re-enable tests if code quality drops (e.g. different milestone)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

rela

tive

te

st r

edu

ctio

n r

ate

Time (Windows 8.1)

Training period

Bug Finding Performance of Tests

How many test executions fail?

#failed test exec

Bra

nch

leve

l

Number of test executions

How many of the failed test executions result in bug reports?

FP TP test-unspecific

TP test-specific

Bra

nch

leve

l

Impact on Development Process

Secondary Improvements• Machine Setup: we may lower the number of machines allocated to testing process

• Developer satisfaction: Removing false test failures increases confidence in testing process

…hard to estimate speed improvement through simulation

“We used the data […] to cut a bunch of bad content and are running a much leaner BVT system […] we’re panning out to scale about 4x and run in well under 2 hours” (Jason Means, Windows BVT PM)

Michaela Greiler@mgreiler

www.michaelagreiler.com

http://research.microsoft.com/en-us/projects/tse/