1 using heuristic search techniques to extract design abstractions from source code the genetic and...

27
1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian S. Mitchell & Spiros Mancoridis Math & Computer Science, Drexel University

Upload: wilfred-holmes

Post on 11-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

1

Using Heuristic Search Techniques to Extract Design Abstractions from

Source Code

The Genetic and Evolutionary Computation Conference (GECCO'02).

Brian S. Mitchell & Spiros MancoridisMath & Computer Science, Drexel University

Page 2: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

2Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

Software Clustering Background

Software clustering simplifies program maintenance and program understandingSoftware clustering techniques help developers fix defects (maintenance), or add a features (program understanding) to existing software systems

Page 3: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

3Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

Understanding the Software Structure

It’s important to understand the software structure when fixing or extending a software system Desirable to change as few of the existing

modules/classes as possible

Problem 1: The structure is complex and often notdocumented for large systems

Problem 2: Ad hoc changes to the source code tend todeteriorate the system’s structure over time

Page 4: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

4Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

Clustering Techniques

A variety of techniques for software clustering have been studied by the reverse engineering community: Source code component similarity

(or dissimilarity) Concept Analysis Subsystem Patterns Implementation-Specific Information

Our clustering approach uses search algorithms

Page 5: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

5Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

Design Extraction with Bunch

Source CodeAnalysis Tools

MDG File

Bunch ClusteringTool

Partitioned MDG File

Visualization ToolSource Code

void main(){ printf(“hello”);}

Acacia Chava

M1

M2

M3

M5M4

M6

M7 M8

M1

M2

M3

M5M4

M6

M7 M8

Bunch GUI

ClusteringAlgorithms

Clustering Tools

ProgrammingAPI

Page 6: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

6Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

Step 1: Creating the MDGExample: The MDG for Apache’sRegular Expression class library

Source CodeAnalysis Tools

Source Codevoid main(){ printf(“hello”);}

Acacia Chava

1. The MDG can be generated automatically using source code analysis tools2. Nodes are the modules/classes, edges represent source-code relations3. Edge weights can be established in many ways, and different MDGs

can be created depending on the types of relations considered

Page 7: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

7Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

Software Clustering with Search Algorithms

Source CodeAnalysis Tools

MDG

Source Codevoid main(){ printf(“hello”);}

Acacia Chava

M1

M2

M3

M5M4

M6

M7 M8

Software ClusteringSearch Algorithms

bP = null;

while(searching()){ p = selectNext(); if(p.isBetter(bP)) bP = p;}

return bP;

“GOOD” MDG Partition“GOOD” MDG Partition

M1

M2

M3

M5M4

M6

M7 M8

SEARCH SPACESet of All

MDG Partitions

M1

M2

M3

M5M4

M6

M8 M7

M1

M2

M3

M5M4

M6

M8 M7

Total = 4140 Partitions

Page 8: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

8Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

Software Clustering with Search Algorithms

Source CodeAnalysis Tools

MDG File

Source Codevoid main(){ printf(“hello”);}

Acacia Chava

M1

M2

M3

M5M4

M6

M7 M8

Software ClusteringSearch Algorithms

bP = null;

while(searching()){ p = selectNext(); if(p.isBetter(bP)) bP = p;}

return bP;

“GOOD” MDG Partition“GOOD” MDG Partition

M1

M2

M3

M5M4

M6

M7 M8

SEARCH SPACESet of All

MDG Partitions

M1

M2

M3

M5M4

M6

M8 M7

M1

M2

M3

M5M4

M6

M8 M7

Total = 4140 Partitions

Search Algorithm Requirements

•Must be able to compare one partition to another objectively.

•We define the Modularization Quality(MQ) measurement to meet this goal.

Given partitions P1 & P2, MQ(P1) > MQ(P2) means that P1 “is better than” P2

Search Algorithm Requirements

•Must be able to compare one partition to another objectively.

•We define the Modularization Quality(MQ) measurement to meet this goal.

Given partitions P1 & P2, MQ(P1) > MQ(P2) means that P1 “is better than” P2

Page 9: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

9Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

Problem: There are too many partitions of the MDG…

1 = 12 = 23 = 54 = 155 = 52

6 = 2037 = 8778 = 41409 = 2114710 = 115975

11 = 67857012 = 421359713 = 2764443714 = 19089932215 = 1382958545

16 = 1048014214717 = 8286486980418 = 68207680615919 = 583274220505720 = 51724158235372

otherwisekSS

nkkifS

knknkn

,11,1,

11

A 15 Module System is about the limit for performing Exhaustive Analysis

The number of MDG partitions grows very quickly, as the number of modules in the system increases…

Page 10: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

10Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

Our Approach to Automatic Clustering

“Treat automatic clustering as a searching problem” Maximize an objective function that

formally quantifies of the “quality” of an MDG partition.

We refer to the value of the objective function as the modularization quality (MQ)

Page 11: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

11Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

Edge TypesWith respect to each cluster, there are two different kinds of edges: edges (Intra-Edges) which are

edges that start and end within the same cluster

edges (Inter-Edges) which are edges that start and end in different clusters

a

b c

CLUSTER

OtherClusters

Page 12: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

12Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

Our Assumption…

“Well designed software systems are organized into cohesive clusters that are loosely interconnected.”

The MQ measurement design must: Increase as the weight of the intra-

edges increases Decrease as the weight of the inter-

edges increases

Page 13: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

13Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

MDG

Not all Partitions are Created Equal ...

Good Partition! Bad Partition!

M1

M2

M1

M2 M3

M1

M2

M4

M3M5

M6M3

M4

M5 M6

M4

M5

M6

MQ(Good Partition) > MQ(Bad Partition)

Page 14: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

14Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

The Software Clustering Problem:Algorithm Objectives

“Find a good partition of the MDG.”

A partition is the decomposition of a set of elements (i.e., all the nodes of the graph) into mutually disjoint clusters.A good partition is a partition where: highly interdependent nodes are grouped in

the same clusters independent nodes are assigned to separate

clusters

The better the partition the higher the MQ

Page 15: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

15Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

Bunch Hill Climbing Clustering Algorithm

Generate a Random Decomposition of MDG

Iteration Step

GenerateNext

Neighbor

MeasureMQ

Compare to BestNeighboring Partition

Better

Measu

re M

Q

Best Neighboring Partition

New

Best

N

eig

hb

ori

ng

Part

itio

n

Convergence

Best Neighboring Partition for Iteration

CurrentPartition

A neighborpartition iscreated byaltering the

currentpartition slightly.

NeighborPartition

Better?

Page 16: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

16Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

Bunch Genetic Clustering Algorithm (GA) Generate a Starting Population from the MDG

Iteration Step

Cro

ssover

Op

era

tion

Best Partition from Final Population

All Generations Processed

CurrentPopulation

P1

P2

P3

Pn

NextPopulation

P1

P2

P3

Pn

P2

Mu

tati

on

Op

era

tion

Next Generation

P1

P2

P3

Pn

P1

P2

P3

Pn

FavorPartitions

with LargerMQ Values

for CrossoverOperation

RANDOMSELECTION

Mutate(Alter)a Small

Number ofPartitions

RANDOMSELECTION

Page 17: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

17Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

Clustering Example – Apache Regular Expression Library

Ran

dom

Part

itio

n

Bunch Partition

< 5 Relations

5-10 Relations

>10 Relations

MD

G

Page 18: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

18Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

Bunch Hill Climbing Clustering Algorithm – Extended Features

Generate a Random Decomposition of MDG

Iteration Step

GenerateNext

Neighbor

MeasureMQ

Compare to BestNeighboring Partition

Better

Measu

re M

Q

Best Neighboring Partition

New

Best

N

eig

hb

ori

ng

Part

itio

n

Convergence

Best Neighboring Partition for Iteration

CurrentPartition

A neighborpartition iscreated byaltering the

currentpartition slightly.

NeighborPartition

Better?

Hill-Climbing AlgorithmExtended Features

•Adjustable Clustering Threshold•Simulated Annealing

Hill-Climbing AlgorithmExtended Features

•Adjustable Clustering Threshold•Simulated Annealing

Page 19: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

19Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

Research ObjectivesInvestigate if the new hill-climbing clustering features impact: The clustering results Clustering performance

Goals Provide configuration

guidance to Bunchusers

Determine performance versus quality tradeoffs associated with different Bunch configurations

Gain intuition into the search space of different systems

Page 20: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

20Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

Case Study DesignBasic test consisted of 1,050 clustering runs 50 runs with clustering threshold set to 0% Incremented clustering threshold by 5% and

repeated the test until clustering threshold reached 100%

Repeated the basic test 3 additional times with simulated annealing altering the initial temperature T(0) and cooling rate Examined 5 systems – compiler, ispell, rcs, dot, and swing

We used the Bunch API for the case study

Page 21: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

21Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

Case Study Results – RCS

0

0.4

0.8

1.2

1.6

2

2.4

0 25 50 75 100

Clustering Threshold

MQ

0

0.4

0.8

1.2

1.6

2

2.4

0 25 50 75 100

Clustering Threshold

MQ

0

0.4

0.8

1.2

1.6

2

2.4

0 25 50 75 100

Clustering Threshold

MQ

0

0.4

0.8

1.2

1.6

2

2.4

0 25 50 75 100

Clustering Threshold

MQ

0

2

4

6

8

10

12

14

0 25 50 75 100

Clustering Threshold

MQ

Ev

alu

ati

on

s (

10

00

's)

0

2

4

6

8

10

12

14

0 25 50 75 100

Clustering Threshold

MQ

Ev

alu

ati

on

s (

10

00

's)

0

2

4

6

8

10

12

14

0 25 50 75 100

Clustering Threshold

MQ

Ev

alu

ati

on

s (

10

00

's)

0

2

4

6

8

10

12

14

0 25 50 75 100

Clustering Threshold

MQ

Ev

alu

ati

on

s (

10

00

's)

No SA T(0)=100=.99

T(0)=100=.90

T(0)=100=.80

RCS

00.20.40.60.81

1.21.4

0 250 500 750 1000

MQ

Clu

ste

rin

g

Th

resh

old

&

MQ

Clu

ste

rin

g

Th

resh

old

&

MQ

Evals

.

MQ

of

Ran

dom

Part

itio

ns

Page 22: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

22Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

Case Study Results – Swing

No SA T(0)=100=.99

T(0)=100=.90

T(0)=100=.80

Clu

ste

rin

g

Th

resh

old

&

MQ

Clu

ste

rin

g

Th

resh

old

&

MQ

Evals

.

MQ

of

Ran

dom

Part

itio

ns

012345678

0 25 50 75 100

Clustering Threshold

MQ

012345678

0 25 50 75 100

Clustering Threshold

MQ

012345678

0 25 50 75 100

Clustering Threshold

MQ

012345678

0 25 50 75 100

Clustering Threshold

MQ

0510

15202530

3540

0 25 50 75 100

Clustering Threshold

MQ

Ev

alu

ati

on

s (

Mill

ion

s)

05

101520

2530

3540

0 25 50 75 100

Clustering Threshold

MQ

Ev

alu

ati

on

s (

Mill

ion

s)

05

1015

2025

3035

40

0 25 50 75 100

Clustering Threshold

MQ

Ev

alu

ati

on

s (

Mill

ion

s)

0

510

1520

25

3035

40

0 25 50 75 100

Clustering Threshold

MQ

Ev

alu

ati

on

s (

Mill

ion

s)

Swing

00.20.40.60.81

1.21.4

0 250 500 750 1000

MQ

Page 23: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

23Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

Case Study Results - Summary

The clustering threshold had an expected and consistent impact on the clustering runtimeThe clustering threshold did not appear to have any impact on the quality of the clustering resultsThe hill-climbing algorithm provides some intuition into the search landscape for the systems studiedThe software clustering results always were better than random generated clusters

Page 24: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

24Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

Case Study Results - Summary

0

0.4

0.8

1.2

1.6

0 25 50 75 100

Clustering Threshold

MQ

Compiler

0

0.4

0.8

1.2

1.6

2

2.4

2.8

0 25 50 75 100

Clustering Threshold

MQ

ispell

0

0.4

0.8

1.2

1.6

2

2.4

0 25 50 75 100

Clustering Threshold

MQ

rcs

0

0.3

0.6

0.9

1.2

1.5

1.8

0 25 50 75 100

Clustering Threshold

MQ

dot

012345678

0 25 50 75 100

Clustering Threshold

MQ

swing

Intuition into the search landscape…

RarePartitions

SystemsThat

ConvergeTo A

ConsistentNeighborhood

MultimodalSearch Space

Page 25: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

25Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

Reduced Percentate of MQ Evaluations using Simulated Annealing

0%

5%

10%

15%20%

25%

30%

35%

40%

compiler ispell rcs dot swing

Case Study Results - Summary

Simulated annealing did not have any noticeable impact on the quality of clustering results.Simulated annealing did appear to reduce the overall runtime needed to cluster the sample systems.

Page 26: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

26Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

Concluding Remarks

It was expected that increasing the clustering threshold would impact the runtime or clustering results – neither was found to be trueSimulated annealing did not improve the quality of the clustering results but did decrease the overall clustering runtimeWe obtained some intuition into the search landscape of the systems studied

Page 27: 1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian

27Drexel University Software Engineering Research Group (SERG)http://serg.mcs.drexel.edu

Questions

Special Thanks To: AT&T Research Sun Microsystems DARPA NSF US Army