creating and analyzing source code repository models - a model-based approach to mining software...
TRANSCRIPT
Model-based Analysis of Source Code Repositories
Markus Scheidgen
1
Martin Schmidt Joachim Fischer
{scheidge,schmidma,[email protected] Universität zu Berlin
Agenda
▶ Software Evolution, Reverse Engineering, and Mining Software Repositories
▶ Model-based Mining of Software Repositories
▶ srcrepo a framework for model-bases analysis of software repositories
▶ Experiments with Eclipse’s software repositories
▶ Conclusions
2
software maintenance
RHEAD…R0
■ quality assessment■ implicit dependencies
software engineering
Software Evolution
3
requirements
M
M {C}
S
userproblem
M
M {C}
S
user
mining software repositories1
software modernization
OMG’s ADM(Achitecture-Driven Modernization)‣AST Meta-Model (ASTM)‣Knowledge Discovery Meta-
Model (KDM)‣Software Metrics Meta-Model
(SMM)
M
SS
M
M
M
{C} {C}
reve
rse
engi
neer
ing forw
ard engineering
transformation
1. H. Kagdi, M.L. Collard, J.I. Maletic: A survey and taxonomy of approaches for mining software repositories in the context of software evolution; Journal of Software Maintenance and Evolution: Research and Practice; Vol.19/Nr.2/2007
Mining Software Repositories – In General
▶ The term mining software repositories (MSR) has been coined to describe a broad class of investigations into the examination of software repositories.
▶ The premise of MSR is that empirical and systematic investigations of repositories will shed new light on the process of software evolution. [1]
▶ Different scopes, e.g. single software projects vs. many software projects
▶ Different goals, e.g. quality assessments and implicit dependencies vs. generalizations about software evolution
4
1. H. Kagdi, M.L. Collard, J.I. Maletic: A survey and taxonomy of approaches for mining software repositories in the context of software evolution; Journal of Software Maintenance and Evolution: Research and Practice; Vol.19/Nr.2/2007
Model-based Mining of Software Repositories
5
MS M{C} reverse engineering
1. E.J. Chikofsky, J.H. Cross: Reverse engineering and design recovery: A taxonomy; IEEE Software; Vol.7/Nr.1/1990
2.R. Lincke, J. Lundberg, W. Löwe: Comparing Software Metrics Tools; 8th International Symp. on Software Testing and Analysis; 2008
Model-based Mining of Software Repositories
5
MS M{C}
MM{Cn}RHEAD…
R0
reverse engineering
1. E.J. Chikofsky, J.H. Cross: Reverse engineering and design recovery: A taxonomy; IEEE Software; Vol.7/Nr.1/1990
2.R. Lincke, J. Lundberg, W. Löwe: Comparing Software Metrics Tools; 8th International Symp. on Software Testing and Analysis; 2008
Model-based Mining of Software Repositories
5
MS M{C}
MM{Cn}RHEAD…
R0{Cn-1} MM
reverse engineering
1. E.J. Chikofsky, J.H. Cross: Reverse engineering and design recovery: A taxonomy; IEEE Software; Vol.7/Nr.1/1990
2.R. Lincke, J. Lundberg, W. Löwe: Comparing Software Metrics Tools; 8th International Symp. on Software Testing and Analysis; 2008
Model-based Mining of Software Repositories
5
MS M{C}
MM{Cn}RHEAD…
R0{Cn-1} MM
{C0}
…
MM… … …
reverse engineering
1. E.J. Chikofsky, J.H. Cross: Reverse engineering and design recovery: A taxonomy; IEEE Software; Vol.7/Nr.1/1990
2.R. Lincke, J. Lundberg, W. Löwe: Comparing Software Metrics Tools; 8th International Symp. on Software Testing and Analysis; 2008
Model-based Mining of Software Repositories
5
MS M{C}
MM{Cn}RHEAD…
R0{Cn-1} MM
{C0}
…
MM… … …
reverse engineering
1. E.J. Chikofsky, J.H. Cross: Reverse engineering and design recovery: A taxonomy; IEEE Software; Vol.7/Nr.1/1990
2.R. Lincke, J. Lundberg, W. Löwe: Comparing Software Metrics Tools; 8th International Symp. on Software Testing and Analysis; 2008
Model-based Mining of Software Repositories
▶ Scope
■ depends on concreter MSR-application and its goals
■ number of software projects: single repositories, large repositories, ultra-large repositories
■ Sources as text and text based metrics, e.g. LOC
■ Declarations only: packages, classes, methods, but no statements, expressions, etc.
■ Full AST with or without cross-references
6
Model-based Mining of Software Repositories
▶ Scope
■ depends on concreter MSR-application and its goals
■ number of software projects: single repositories, large repositories, ultra-large repositories
■ Sources as text and text based metrics, e.g. LOC
■ Declarations only: packages, classes, methods, but no statements, expressions, etc.
■ Full AST with or without cross-references
7
Model-based Mining of Software Repositories
▶ MSR tools are already “model-based”, but in a proprietary manner
▶ Idea: existing reverse engineering framework and corresponding standard meta-models and modeling frameworks instead of proprietary solutions
▶ Goals
■ deal with heterogeneity (different version control systems, different languages)
■ reuse of existing meta-models, transformations, and languages
■ interoperability with existing analysis tools
■ retaining meaningful scalability8
Model-based MSR Strategies
9
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
Model-based MSR Strategies
9
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
Model-based MSR Strategies
9
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
A2 B3
Model-based MSR Strategies
9
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
A2 B3
Model-based MSR Strategies
9
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
A2 B3
Model-based MSR Strategies
9
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
snapshot
snapshot
A2 B3
Model-based MSR Strategies
9
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
snapshot
snapshot
A2 B3
Model-based MSR Strategies
9
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
M1
Analysis(r)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
9
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
M1
Analysis(r)
M2
snapshot
snapshot
A2 B3
Model-based MSR Strategies
9
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
M1
Analysis(r)
M3
M2
snapshot
snapshot
A2 B3
Model-based MSR Strategies
9
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
M1
Analysis(r)
M3
M2
X
R
Checkout+
X
CUs
Parse+Analysis
!
�X
R
Checkout+
X
�CUs
(Parse+Merge) +Analysis
!
�X
R
X
�CUs
(Load+Merge) +Analysis
!
�X
R
X
�CUs
(Load+Analysis
0)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
10
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
M1
Analysis(r)
M3
M2
snapshot
snapshot
A2 B3
Model-based MSR Strategies
10
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
M1
Analysis(r)
M3
M2
f
B.f
fB.f
Parse(d)X
d2�CUs(r)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
10
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)X
d2�CUs(r)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
10
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)X
d2�CUs(r)
X
R
Checkout+
X
CUs
Parse+Analysis
!
�X
R
Checkout+
X
�CUs
(Parse+Merge) +Analysis
!
�X
R
X
�CUs
(Load+Merge) +Analysis
!
�X
R
X
�CUs
(Load+Analysis
0)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
11
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)X
d2�CUs(r)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
11
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)X
d2�CUs(r)
Load(r)Save(r)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
11
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)X
d2�CUs(r)
Load(r)Save(r)
X
R
Checkout+
X
CUs
Parse+Analysis
!
�X
R
Checkout+
X
�CUs
(Parse+Merge) +Analysis
!
�X
R
X
�CUs
(Load+Merge) +Analysis
!
�X
R
X
�CUs
(Load+Analysis
0)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
12
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)X
d2�CUs(r)
Load(r)Save(r)
X
R
Checkout+
X
CUs
Parse+Analysis
!
�X
R
Checkout+
X
�CUs
(Parse+Merge) +Analysis
!
�X
R
X
�CUs
(Load+Merge) +Analysis
!
�X
R
X
�CUs
(Load+Analysis
0)
Analysis0(r)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
12
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)X
d2�CUs(r)
Load(r)Save(r)
Analysis0(r)
X
R
Checkout+
X
CUs
Parse+Analysis
!
�X
R
Checkout+
X
�CUs
(Parse+Merge) +Analysis
!
�X
R
X
�CUs
(Load+Merge) +Analysis
!
�X
R
X
�CUs
(Load+Analysis
0)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
13
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)X
d2�CUs(r)
Load(r)Save(r)
Analysis0(r)
Research Questions
▶ Assumptions
■ Development of MSR-applications based on models, transformation languages and standardized meta-models is favorable
■ Some MSR-applications need to analyze source code on a deep (AST) level
■ MSR-analysis is performed iteratively
▶ Hypotheses
■ Models of source code repositories can be created and persisted
■ Traversing existing persistent models of source code repositories is much faster than traversing transient models that are created from version control system on the fly
14
srcrepo – A Framework for Model-based MSR
▶ Eclipse’s MoDisco as reverse engineering framework
■ reverse engineering for Java, based on EMF
■ Support for many JRE-ased languages: Java, xText, JSP, XML
■ creates instances of a Java EMF meta-model that corresponds to the handwritten JDT AST-model
■ provides transformation to language independent artifacts, e.g. KDM
▶ EMF-Fragments1 to store very large-models
■ uses No-SQL databases and stores larger model fragments within database entries
■ in contrast to object-by-object stores such as ORM-based CDO or No-SQL-based Morsa or Neo4J
▶ Xtend programming with higher order functions to mimic OCL-style definition of software metrics2
15
1. M.Scheidgen, A.Zubow,J.Fischer,T.H.Kolbe: Automated and Transparent Model Fragmentation for Persisting Large Models; ACM/IEEE 15th International Conference on Model Driven Engineering Languages & Systems (MODELS); Innsbruck; 2012
2. M.Scheidgen, J.Fischer: Model-based Mining of Software Repositories; 8th Systems and Modeling Conference, Valencia, Spain, September 29th, 2014
“OCL” to Calculate Metrics of AST-Models
16
// Weighted number of methods per class.def wmc(AbstractTypeDeclaration type,(Block)⇒int weight) { type.bodyDeclarations.sum[if (body != null) weight.apply(it.body) else 1]}
Experiments
▶ Eclipse Foundation sources, i.e. Eclipse platform and plug-ins (large scale software repository)
▶ Organized in different (couple hundred) projects: jdt, cdt, emf, ...
▶ Available via GITHub
▶ GIT repositories can be gathered automated via GITHub’s REST-ful API
▶ 200 largest Eclipse repositories that actually contained Java code: 6.6 GB Git, 400 MLOC, 250 GB model with 4 billion objects.
17
Example Plot: Halstead-length for each Revision; Eclipse CDT
18
2004 2006 2008 2010 2012 2014 2016
02
46
81
0
time (years)
WM
C w
ith
Ha
lste
ad
len
gth
(x 1
06)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++ +++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++ ++++++++++++++++++++
++++++++++++++++++++ ++++++++++++++++++ ++
+++++++++++++++++
+
++++++++ +++++++++
++++++++++++++++++ +
+++++++
++++ ++ + +++++ +++++++++++++++++++++++++++++++++++++++
Model Create v Analysis Times
19
jdt.u
i
xte
xt
ecl
ipse
link
jdt.c
ore
sw
t
cdt
ocl
ptp
org
.asp
ectj
cdo
udfmerge/incrementloadsaveparsecheckout
time
(hou
rs)
0
2
4
6
8
10
12
14
checkout parse save load merge/increment
udf
020
040
060
080
0
avg
time
per r
evis
ion
(ms)
jdt.c
ore
cdt
jdt.u
i
cdo
em
f.em
fsto
re.c
ore
em
f
jdt.d
ebug
em
f.com
pare
em
f.tex
o
em
f.diff
mer
ge.c
ore
GIT sizeModel size
GIT repository vs model sizein
GB
0
5
10
15
Diskspace
20
Delta-Compression1
21
cdt
web
tool
s
gm
f−to
olin
g
rap
jdt.c
ore
delta-modelsinitial revisionsuncompressed
Named element matching
GB
0
2
4
6
8
10
12
14
cdt
web
tool
s
gm
f−to
olin
g
rap
jdt.c
ore
detla-modelsinitial revisionsuncompressed
Meta-class matching
GB
0
2
4
6
8
10
12
14
cdt
web
tool
s
gm
f−to
olin
g
rap
jdt.c
ore
delta-linesinitial revisionsuncompressed
Line matching
MLi
nes
0
20
40
60
80
100
initialrevisions
deltamodels formeta-class
deltalines
delta modelsfor named elements
020
4060
80
Compressed relative to full size
(%)
parse compressnamed
elements
compressmeta-class
decompr.meta-class
decompr.named
elements
parse compressnamed
elements
compressmeta-class
decompr.meta-class
decompr.named
elements
050
010
0015
00
Avg. execution times
avg.
tim
e pe
r rev
isio
n (m
s)
1050
200
1000
Avg. execution times (logarithmic)
avg.
tim
e pe
r rev
isio
n (m
s)
1. M.Scheidgen: Evaluation of Model Comparison for Delta-Compression in Model Persistence; BigMDE 2016 (at STAF 2016),Vienna, Austria, July 6-7, 2016
Model Creation v Analysis with Delta-Compression
22
jdt.u
i
xte
xt
ecl
ipse
link
jdt.c
ore
sw
t
cdt
ocl
ptp
org
.asp
ectj
cdo
udfmerge/incrementloadsavecompressparsecheckout
time
(hou
rs)
0
2
4
6
8
10
12
14
jdt.u
i
xte
xt
ecl
ipse
link
jdt.c
ore
sw
t
cdt
ocl
ptp
org
.asp
ectj
cdo
udfmerge/incrementloadsaveparsecheckout
time
(hou
rs)
0
2
4
6
8
10
12
14
without compression with compression
Conclusions
▶ MSR can support software evolution and helps to understand software evolution
▶ Traversing a source code repository to gather information (MSR) is very time consuming, especially with iterative analysis
▶ It is possible to save most of this time via saving data in its model state, at the cost of comparably large models that need to be persisted
▶ The MSR analysis execution time savings are considerable
23