assessing the frequency of empirical evaluation in software modeling research

Assessing the Frequency ofEmpirical Evaluation in

Software Modeling Research

Workshop on Experiences and Empirical Studies in Software Modelling (EESSMod)

October 17, 2011

Jeffrey C. Carver, Eugene Syriani and Jeff Gray (presenter)

University of AlabamaDepartment of Computer Science{carver, esyriani, gray}@cs.ua.edu

2

Carver, Syriani, Gray Empirical Evaluation at MoDELS

Background• Many creative modeling ideas

• Impression that the field has not followed the traditional Scientific Method

• Most new techniques are not (thoroughly) evaluated

• Investigate the prevalence of this phenomenon– Considered MODELS papers from 2006-2010– Also considered papers from empirical conference (ESEM)

3


Background: Empirical Studies

The understanding of a discipline evolves over time We get more sophisticated in our methods We are able to test and prove or disprove hypotheses

The empirical paradigm has been used in many other fields, e.g., physics, medicine, manufacturing

Understanding a Discipline

BuildingModels

application domain, workflows,

problem solving processes

Checking Understanding

testing models, experimenting in the real

world

Analyzing Results

learn, encapsulate knowledge

and refine models

Evolving Models

“models” used moregenerally on this slide

5


Empirical Studies: Misconceptions

• Empirical studies are not “one-shot deals.” Studies on live development projects are not the only ones that matter.

• Software engineering is a laboratory science– Understanding our discipline involves

• Observation, reflection, model building, experimentation• Followed by iteration

– Symbiotic nature of research and development• Research needs laboratories to observe & manipulate variables• Development needs to understand how to build systems better

6


Empirical Studies: Misconceptions• Overall purpose

• “We ran a study of technology X and now we know…”– Technology X doesn’t work (NO)– Technology X performed worse than technology Y in our environment (YES)

• “Environment” includes people & their expertise, project goals, etc.• Measuring performance implies we decided on some metric that we felt was

an important indicator– No solution is really expected to be better for all users under all conditions

Yes/No Certification of a technology

Assist in evolution

Find appropriate environment

Yield insightsand answers

7


Empirical Studies: Outputs

• Empirical study can help to provide information of interest to teams that might eventually adopt a technology:– Does it work better for certain types of people?

• Novices: It’s a good solution for training• Experts: Users need certain background knowledge…

– Does it work better for certain types of systems?• Static/dynamic aspects, complexity• Familiar/unfamiliar domains

– Does it work better in certain development environments?• Users [did/didn’t] have the right documentation, knowledge, amount

of time, etc… to use itShull, 2004

8


Our Objective and Methodology

• Goal: Determine how many recent modeling papers had some type of empirical evaluation of their claims

• Three step methodology– Develop initial characterization scheme– Identify candidate papers– Review candidate papers and finalize

characterization

9


Characterization SchemeType Empirical

EvaluationInvolved Human

ParticipantsComparison against

other Methods1. No evaluation X X X

2. Non-human, proposed tool only

X X

3. Non-human, comparison

X

4. Human observation

X

5. Human-based Controlled Experiment

Formative Case Studies: Papers gather information about use of technique in practice

10


ResultsYear Total No Eval Non-Human Human-Based

No Comparison

Comparison Observation Controlled Experiment

Formative Case Study

2006 51 42 (82%) 6 (12%) 0 (0%) 1 (2%) 1 (2%) 1 (2%)

2007 45 36 (80%) 2 (4%) 5 (11% 0 (0%) 2 (4%) 0 (0%)

2008 58 39 (67%) 8 (14%) 2 (3%) 2 (3%) 4 (7%) 3 (5%)

2009 58 45 (78%) 5 (9%) 2 (3%) 2 (3%) 1 (2%) 3 (5%)

2010 54 33 (61%) 8 (15%) 4 (7%) 2 (4%) 4 (7%) 3 (6%)

Total 266 195 (73%) 29 (11%) 13 (5%) 7 (3%) 12 (5%) 10 (3%)

11


73%

11%

5%3%

5% 4%

No EvaluationNo ComparisonComparisonObservationControlled ExperimentFormative Case Study

Results – Summary from 2006-2010

17%

12


Results - Trends

2006 2007 2008 2009 20100%

10%

20%

30%

40%

50%

60%

70%

80%

90%

No Evaluation No Comparison ComparisonObservation Controlled Experiment Formative Case Study

13


Results:Human-Based Controlled Experiments

• Total of 12 in 5 years! Should be more

• Observations– Generally, low level of detail reported– Most had less than 25 participants

• 2 had over 50, 1 did not even report the number– Most participants were undergraduate students– General misunderstanding in many papers by

equating “discussion” to “evaluation”

14


Results:Formative Case Studies

• Total of 10, need to see more

• 4 did not involve humans– Analyze existing source code to understand how various modeling

tools would/would not work

• 6 involved humans– Surveys to understand how existing tools were not meeting

developer needs

• Generally, a study of output requirements for needed tools

15


Results:ESEM Focus

• The ESEM conference has three types of papers: Regular Papers, Short Papers, and Posters

• Across the same 5 year period, we only found 17 modeling papers– Of those 17 papers, only 4 were Regular Papers (10 pages

IEEE or ACM format) out of 178 Regular candidates– 10 were Short Papers (4 pages) out of a total of 118 Short

Papers– 3 of the papers were Poster summaries

• Even with the empirical area, modeling papers are not very well represented (typically, just short papers)

16


Conclusions• Summary:

– Rigor of empirically validated research in software modeling is weak– Very large percentage of papers with no evaluation– Did not include technical reports or extended publication in a journal– Plan to repeat analysis with SoSym– Would like to push the community to conduct more empirical

evaluations– Paper has URLs pointing to the data from our observations

• Recommendations:– Team up with empirical researchers– Venues need to provide additional space for reporting empirical

results (e.g., 2 extra pages in paper length for those papers that have a clear evaluation)

17


Questions or comments?

assessing the frequency of empirical evaluation in software modeling research

Documents