can we induce change with what we measure?

38
Data-driven software engineering @Microsoft Michaela Greiler

Upload: michaela

Post on 13-Dec-2014

238 views

Category:

Technology


2 download

DESCRIPTION

Tom DeMarco states that “You can’t control what you can’t measure”, but how much can we change and control (with) what we measure? This talk investigates the opportunities and limits of data-driven software engineering, shows which opportunities lie ahead of us when we engage in mining and analyzing software engineering process data, but also highlights important factors that influence the success and adaptability of data-based improvement approaches.

TRANSCRIPT

Page 1: Can we induce change with what we measure?

Data-driven software engineering @Microsoft

Michaela Greiler

Page 2: Can we induce change with what we measure?

Data-driven software engineering @Microsoft• How can we optimize the testing process?

• Do code reviews make a difference?

• Is coding velocity and quality always a tradeoff?

• What’s the optimal way to organize work on a large team?

MSR Redmond/TSE:Michaela GreilerJacek CzerwonkaWolfram SchulteSuresh Thummalapenta

MSR Redmond:Christian BirdKathryn McKinleyNachi NagappanThomas Zimmermann

MSR Cambridge:Brendan MurphyKim Herzig

Page 3: Can we induce change with what we measure?

0

20

40

60

80

100

201020102011201120112011201120112011201120112011201120112012201220122012201220122012201220122012201220122013201320132013201320132013201320132013

11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10

Code Coverage trigger of Checkins

% completely covered % somewhat covered % not covered

Page 4: Can we induce change with what we measure?

Reviewer recommendation: Does experience matter?

Page 5: Can we induce change with what we measure?

Can we change with what we can measure?

Michaela Greiler

Page 6: Can we induce change with what we measure?

YES

Page 7: Can we induce change with what we measure?

YES

that’s the danger!

Page 8: Can we induce change with what we measure?

What is measured?

0

1

2

3

4

5

6

7

8

Carl Lisa Rob Danny

Number Bugs

What is changed?

0

0.5

1

1.5

2

2.5

Carl Lisa Rob Danny

Number Bugs

Code Quality

Page 9: Can we induce change with what we measure?

What is measured?

0

1

2

3

4

5

6

7

8

Carl Lisa Rob Danny

Number Bugs

What is changed?

0

0.5

1

1.5

2

2.5

Carl Lisa Rob Danny

Number Bugs

Code Quality

Page 10: Can we induce change with what we measure?

SOCIO TECHNICAL CONGRUENCE “Design and programming are human activities; forget that and all is lost” – Bjarne Stroustrop

Page 11: Can we induce change with what we measure?

So should we go without any measurements?

Page 12: Can we induce change with what we measure?

InterpretationData Collection Usage

Lessons learned

No

Garbage!

Page 13: Can we induce change with what we measure?

• What is codemine? What data does codemine have?

Page 14: Can we induce change with what we measure?

GMQ vs. Opportunistic data collection

• Easily available ≠ what’s needed

• Determine the needed data

• Find proxy measures if needed

• Know the analysis before collecting the data

Otherwise, data is not usable for the intended purpose

• Goal – Question – Metric

• Check for completeness, cleanness/ noise and usefulness

• Data background• How was data generated?• Why was it generated?• Who consumes the data?• What about outliers?• How was the data processed?

Page 15: Can we induce change with what we measure?

Interpretation needs domain knowledge

Page 16: Can we induce change with what we measure?

Tools, processes,

practices and policies.

Release schedule

Time

Engi

ne

ers

What roles exist?Who does what?Responsibilities?

M1

M2

Beta

Organization of code bases

Team structure and culture.

Page 17: Can we induce change with what we measure?

You cannot compare 1:1

Page 18: Can we induce change with what we measure?

Engineers want to understand the nitty-gritty

• How do you calculate the recommended reviewers?

• Why was that person recommended?

• Why is Lisa not recommended?

Page 19: Can we induce change with what we measure?

Simplicity first

Fileswithout

bugs

Fileswithbugs

Files withoutbugs: main contributor

made > 50% of all edits

Files with bugs: main

contributor made < 60% of

all edits

Ownership metric:Proportion of edits of all edits for the contributor with the

most edits

Reporting vs. PredictionComprehension vs. automation

If you can do it with a decision tree… do it…

Page 20: Can we induce change with what we measure?

Iterative process with very close involvement of product teams and domain experts.

It’s a dialog It’s a back and forth

Page 21: Can we induce change with what we measure?

Mixed Method Research

Is a research approach or methodology

• for questions that call for real-life contextual understandings;

• employing rigorous quantitative research assessing magnitude and frequency of constructs and

• rigorous qualitative research exploring the meaning and understanding of constructs;

DR. MARGARET-ANNE STOREY

Professor of Computer Science University of Victoria

All methods are inherently flawed!

Generalizability

Precision Realism

DR. ARIE VAN DEURSEN

Professor of Software Engineering Delft University of Technology

Page 22: Can we induce change with what we measure?

Foundations of Mixed

Methods Research

Designing

Social Inquiry

Qualitative Research: Mixed Method Research

• Interviews

• Observations

• Focus groups

• Contextual Inquiry

• Grounded Theory

• …

Page 23: Can we induce change with what we measure?

A Grounded Theory Study

23

Systematic procedure to discover a theory from (qualitative) data

S. Adolph, W. Hall, Ph. Kruchten. Using Grounded theory to study the experience of software development. Empirical Software Engineering, 2011.

B. Glaser and J. Holton. Remodeling grounded theory. Forum Qualitative Res., 2004.

Glaser and Strauss

Page 24: Can we induce change with what we measure?

Deductive versus inductive

A deductive approach is concerned with developing a hypothesis (or hypotheses) based on existing theory, and then designing a research

strategy to test the hypothesis (Wilson, 2010, p.7)

Inductive approach starts with observations. Theories emerge towards the end of the research and as a result of careful examination of

patterns in observations (Goddard and Melville, 2004).

Theory Hypotheses Observation Confirm/Reject

Observation Patterns Theory

Page 25: Can we induce change with what we measure?

All models are wrong but some are useful (George E. P. Box )

Page 26: Can we induce change with what we measure?

Theo: Test Effectiveness Optimization from History

Kim Herzig*, Michaela Greiler+

, Jacek Czerwonka+, Brendan Murphy*

*Microsoft Research, Cambridge +Microsoft Corporation, US

Page 27: Can we induce change with what we measure?

Improving Development Processes

Product / Service

Lega

cych

ange

s

New

pro

du

ctfe

atu

res

Tech

no

logy

chan

ges

Development Environment

$Speed

R

Cost

Quality / Risk

(should be well balanced)

Microsoft aims for shorter release cycles

Empirical data to support & drive decisions

• Speed up development processes (e.g. code velocity)• More frequent releases• Maintaining / increasing product quality

Joint effort by MSR & product teams• MSR Cambridge: Brendan Murphy, Kim Herzig• TSE Redmond: Jacek Czerwonka, Michaela Greiler• MSR Redmond: Tom Zimmermann, Chris Bird, Nachi Nagappan• Windows, Windows Phone, Office, Dynamics product teams

Page 28: Can we induce change with what we measure?

Software Testing for Windows

Winmain (main branch)

Quality gate

(system testing)

Quality gate

(system & component testing)

Quality gate

(component testing)

time

Development branch

Multiple area branches

Multiple component branches

Software testing is very expensive• Thousands test suites executed, millions test cases executed

• On different branches, architectures, languages, etc.

• We tend to repeat the same tests over and over again

• Too many false alarms (failures due to test and infrastructure issues)

• Each test failures slows down product development

• Aims to find code issues as early as possible

• At the cost of slower product development

Actual problem

Current process aims for maximal protection

{Simplified illustration}

Page 29: Can we induce change with what we measure?

Software Testing for Office

Software testing is very expensive• Thousands test suites executed, millions test cases executed

• On different branches, architectures, languages, etc.

• We tend to repeat the same tests over and over again

• Too many false alarms (failures due to test and infrastructure issues)

• Each test failures slows down product development

• Aims to find code issues as early as possible

• At the cost of slower product development

Actual problem

Current process aims for maximal protection

Dev Inner Loop

BVT and CVTon main

Dog food

Different • Branching structure• Development process• Testing process• Release schedules• …{Simplified illustration}

Page 30: Can we induce change with what we measure?

Goal

Reduce the number of test executions …

… without sacrificing code quality

Dynamic, self-adaptive optimization model

Page 31: Can we induce change with what we measure?

Solution

Reduce the number of test executions …

• Run every test at least once before integrating code change into main branch (e.g., winmain).

• We eventually find all code issues but take risk of finding them later (on higher level branches).

… without sacrificing code quality

High cost, unknown

value$$$$$

High cost, low value

$$$$

Low cost,low value

$

Low cost, good value

$$

How likely is a test causing:1) false positives or 2) finding code issues?

Analyze historic data:- Test Events- Builds- Code Integrations

Analyze past test results- Passing tests, false alarms, detected code issues

Page 32: Can we induce change with what we measure?

Bug finding capabilities change with context

Page 33: Can we induce change with what we measure?

Solution

Using cost function to model risk.

𝑪𝒐𝒔𝒕𝑬𝒙𝒆𝒄𝒖𝒕𝒊𝒐𝒏 > 𝑪𝒐𝒔𝒕𝑺𝒌𝒊𝒑 ? suspend ∶ execute test

𝐶𝑜𝑠𝑡𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 = 𝐶𝑜𝑠𝑡𝑀𝑎𝑐ℎ𝑖𝑛𝑒/𝑇𝑖𝑚𝑒 ∗ 𝑇𝑖𝑚𝑒𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 + "Cost of potential false alarm"

= 𝐶𝑜𝑠𝑡𝑀𝑎𝑐ℎ𝑖𝑛𝑒/𝑇𝑖𝑚𝑒 ∗ 𝑇𝑖𝑚𝑒𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 + (𝑃𝑟𝑜𝑏𝐹𝑃 ∗ 𝐶𝑜𝑠𝑡𝐷𝑒𝑣𝑒𝑙𝑜𝑝𝑒𝑟/𝑇𝑖𝑚𝑒 ∗𝑇𝑖𝑚𝑒𝑇𝑟𝑖𝑎𝑔𝑒 )

𝐶𝑜𝑠𝑡𝑆𝑘𝑖𝑝 = "Potential cost of finding a defect later"

= 𝑃𝑟𝑜𝑏𝑇𝑃 ∗ 𝐶𝑜𝑠𝑡𝐷𝑒𝑣𝑒𝑙𝑜𝑝𝑒𝑟/𝑇𝑖𝑚𝑒 ∗ 𝑇𝑖𝑚𝑒𝐹𝑟𝑒𝑒𝑧𝑒 𝑏𝑟𝑎𝑛𝑐ℎ ∗ #𝐷𝑒𝑣𝑒𝑙𝑜𝑝𝑒𝑟𝑠𝐵𝑟𝑎𝑛𝑐ℎ

Test

Cost to run a test.

Value of output.

Page 34: Can we induce change with what we measure?

Current Results

Simulated on Windows 8.1 development period (BVT only)

Page 35: Can we induce change with what we measure?

Dynamic, Self-Adaptive

Decision points are connected to each otherSkipping tests influences the risk factors of higher level branches

We re-enable tests if code quality drops (e.g. different milestone)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

rela

tive

te

st r

edu

ctio

n r

ate

Time (Windows 8.1)

Training period

Page 36: Can we induce change with what we measure?

Bug Finding Performance of Tests

How many test executions fail?

#failed test exec

Bra

nch

leve

l

Number of test executions

How many of the failed test executions result in bug reports?

FP TP test-unspecific

TP test-specific

Bra

nch

leve

l

Page 37: Can we induce change with what we measure?

Impact on Development Process

Secondary Improvements• Machine Setup: we may lower the number of machines allocated to testing process

• Developer satisfaction: Removing false test failures increases confidence in testing process

…hard to estimate speed improvement through simulation

“We used the data […] to cut a bunch of bad content and are running a much leaner BVT system […] we’re panning out to scale about 4x and run in well under 2 hours” (Jason Means, Windows BVT PM)

Page 38: Can we induce change with what we measure?

Michaela Greiler@mgreiler

www.michaelagreiler.com

http://research.microsoft.com/en-us/projects/tse/