17/10/2015dr andy brooks1 msc software maintenance ms viðhald hugbúnaðar fyrirlestrar 37 og 38...

36
22/03/22 Dr Andy Brooks 1 MSc Software Maintenance MS Viðhald hugbúnaðar Fyrirlestrar 37 og 38 Program Exploration with Dora

Upload: asher-blake

Post on 01-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

20/04/23 Dr Andy Brooks 1

MSc Software MaintenanceMS Viðhald hugbúnaðar

Fyrirlestrar 37 og 38

Program Exploration with Dora

20/04/23 Dr Andy Brooks 2

Case StudyDæmisaga

ReferenceExploring the Neighborhood with Dora to Expedite

Software Maintenance, Emily Hill, Lori Pollock, and K. Vijay-Shanker, 22nd International Conference on Automated Software Engineering (ASE´07), pp 14-23, 2007 ©ACM

Lexical information• When searching code, developers can write a

query to retrieve program elements that might be relevant to the maintenance task.

• Lexical information located in comments and identifier names can be helpful.

• Such searches, however, can retrieve many program elements which are irrelevant to the maintenance task (false positives).

• Lexical searches ignore structural information (call graphs, type hierarchies, etc.)

20/04/23 Dr Andy Brooks 3

savingsAccount, loanAccount, longTermSavingsAccount, longTermLoanAccount…

Structural information

• When navigating code, developers can follow call chains or relationships in type hierarchies.

• A single program element, however, can be structurally connected to “tens or hundreds of other elements” only a few of which are relevant to the maintenance task.– A method might have 10 callers – does the developer need to

navigate to and understand what all 10 do?– A method might have 10 callees – does the developer need to

navigate to and understand what all 10 do?– A method might belong to class which belongs to an inheritance

hierarchy made up of 10 classes – does the developer need to navigate to and understand what all 10 do?

20/04/23 Dr Andy Brooks 4

Dora the Program Explorerexploradora is the Spanish word for a female explorer

• “... most existing program exploration techniques use either lexical or structural information, despite evidence that successful programmers use lexical as well as structural information to explore programs.”

• Dora exploits both program structure and lexical information to help developers more effectively explore programs.

• Dora uses lexical information to prune away irrelevant program structure edges.

20/04/23 Dr Andy Brooks 5

Dora the Program Explorerexploradora is the Spanish word for a female explorer

• Dora outputs a relevant neighbourhood given a starting seed and a natural language query related to the maintenance task.

• “The current implementation of Dora uses the program call graph as the program structure representation to be explored and methods as the seed elements.– “call graphs are relatively inexpensive to calculate”

• Dora scores method relevance with respect to the natural language query.

20/04/23 Dr Andy Brooks 6

from Emily Hill, University of Delaware

Shaded area is the relevant neighbourhood.

20/04/23 7

seed

‘Add Auction’ concern example• OSS program jBidWatcher http://www.jbidwatcher.com/

– “A Java-based application allowing you to monitor auctions you're not part of, submit bids, snipe (bid at the last moment), and otherwise track your auction-site experience.”

• A concern is informally described as a high-level idea, or feature, implemented in code.

• The ‘Add Auction’ concern is made up of two components: a trigger and a handler.

• The trigger processes the user-initiated GUI event.

• The handler adds the auction .

• These two components are linked by a data dependence on the event queue.

20/04/23 Dr Andy Brooks 8

auction/uppboð

20/04/23 Dr Andy Brooks 9

A lexical search using ‘add*auction’ misses the shaded methods.

DoAction calls 38 methods but only two are relevant to the concern: doAdd and doPasteFromClipBoard. Dora helps eliminate breadth choices.

Dora can expand deep call chains.

Figure 1: ‘Add Auction’ concern.

36 other methods

Figure 2: Triggering an ‘add auction’ event

20/04/23 Dr Andy Brooks 10

‘Add Auction’ concern example

DoAdd and DoPasteFromClipBoard

• “It is not obvious from the method name that DoPasteFromClipBoard is relevant to adding an auction.”

• A developer skimming through code containing many method calls might easily miss the relevance of DoPasteFromClipBoard.

• So the lexical techniques employed in Dora analyze both the method signature and the method body.

20/04/23 Dr Andy Brooks 11

‘Add Auction’ concern example

Eclipse´s simple lexical search• Querying methods rather than files with ‘add*auction’ in

Eclipse matches 50 methods, only 11 of which are relevant to the concern.

• The 3 methods in the top ten (from the 50) have dashed lines in Figure 1.

• “The lexical search misses 4 relevant methods that are easily found using structural call edges.”

20/04/23 Dr Andy Brooks 12

‘Add Auction’ concern example

Eclipse´s call hierarchy feature• Developers can recursively view all descendants or all

ancestors of a method but not both at the same time.• “For example, starting from the Auctions.addEntry

method, it is impossible to realize that Auctions.addEntry and AuctionEntry.AuctionEntry share the caller JBidMouse.addAuction in Eclipse’s call hierarchy without changing to a di erent starting point.”ff

20/04/23 Dr Andy Brooks 13

‘Add Auction’ concern example

Auctions.addEntry

AuctionEntry.AuctionEntry

JBidMouse.addAuction

Step 1. Determine the query• Query terms could be derived from:

– the maintenance request itself• “Adding an auction does not work. Fix it.”

– expert knowledge• “The functionality here makes use of the M-V-C pattern. Look for

a relevant model and view.”– previous simple lexical searches

• “I found one reference to ADD_AUCTION, maybe I should search with ‘add*auction’.”

– interactive query expansion: typing ADD reveals ADD_AUCTION

• Users initially unfamiliar with the code may have to evolve their query term selection as they become more familiar with the code.

20/04/23 Dr Andy Brooks 14

4. PROGRAM EXPLORATION WITH DORA

Step 2. Identify the seed method set

• The seed method set could come from:– expert knowledge

• “This method is definitely the key to the successful addition of an auction.”

– developer´s knowledge• “I think this method is relevant to the addition of an auction.”

– previous simple lexical searches• “I found one method that seems relevant to the addition of an

auction.”

• Dora works with static program models so seed methods not connected statically must be added to the seed method set.

20/04/23 Dr Andy Brooks 15

4. PROGRAM EXPLORATION WITH DORA

Step 3. Identify the relevant neighborhood

• From the methods in the seed method set, Dora explores the call graph edges, scoring each method´s relevance according to the query input by the user.– lexical-based relevance scores

• Low-scoring methods are removed from consideration according to some threshold.

• Exploration is recursive.

20/04/23 Dr Andy Brooks 16

4. PROGRAM EXPLORATION WITH DORA

Step 4. Output the relevant neighbourhood

• Dora displays the relevant call graph neighborhood (the relevant neighborhood).

• The user can add further query terms or further seed methods and rerun Dora if the relevant neighborhood seems incomplete.

20/04/23 Dr Andy Brooks 17

4. PROGRAM EXPLORATION WITH DORA

Thresholds• Callers and callees are candidates for the relevant

neighborhood.

• A method which scores higher than a given threshold t1 is added to the relevant neighborhood.

• A method which scores less than a given threshold t2 is considered irrelevant.

• A method scoring between t1 and t2 is explored further.

• “This use of two thresholds guards against missing very relevant methods that are connected by a borderline irrelevant method.”

• Default values/Sjálfgefin gildi: t1 = 0,5 and t2 = 0,3

20/04/23 Dr Andy Brooks 18

5.1.1 Term Frequency

• Simple term frequency (tf) can be used to determine the relevance of a method.– The word ‘auction’ appears 25 times in the addAuction method.– The word ‘sort’ appears only once in the addAuction method.

• A term, however, can occur very frequently throughout the source code. – ‘auction’ appears in 470 of the 1,812 methods in jBidWatcher.

• The more methods that include a term, the less the term discriminates between methods.

20/04/23 Dr Andy Brooks 19

5.1 Components of the Method Relevance Score

5.1.1 Term Frequency

• To combat this problem, tf-idf scores are calculated.– tf * idf where idf, the inverse document frequency for a term is:

20/04/23 Dr Andy Brooks 20

5.1 Components of the Method Relevance Score

_ _ _ln( )

_ _ _ _ _

total number of methodsidf

number of methods containing the term

‘auction’ appears in 470 of the 1,812 methods in jBidWatcher

‘add’ appears in 261 of the 1,812 method s in jBidWatcher

So in scoring a term in a method, more occurrences of ‘auction’ are needed to obtain the same tf-idf score for ‘add’.

ln natural log

5.1.1 Term Frequency

• A preprocessing step is applied to queries and methods.• Identifiers are split into terms based on non-alphabetic

characters.• add_auction becomes ‘add’ and ‘auction’

• Identifiers are also split into terms based on camel case - the practice of writing compound words or phrases in which the beginning letters of words are capitalized.– addAuction becomes ‘add’ and ‘auction’

• Terms are converted into lower case and stemmed using Porter´s stemming algorithm.– ‘auctioned’ is mapped to ‘auction’

20/04/23 Dr Andy Brooks 21

5.1 Components of the Method Relevance Score

5.1.2 Method Features

• Class and package names are not used because they are shared by many system components.

• Terms in method names are considered to be more important than terms in method bodies.

• The number of statements within a method body containing a term are counted and multiplied by the term´s idf score. These tf-idf scores are summed and then normalized by the method length.– avoids any bias toward longer methods

• Is a method binary (i.e. no source code available)? This is also taken into account.

20/04/23 Dr Andy Brooks 22

5.1 Components of the Method Relevance Score

Weighting name, statement and binary

• To determine weightings, logistic regression was applied to a training set.

• The training set comprised methods from nine concerns used in a previous study.

• Methods one call edge away were also included.• “We manually inspected each method and

annotated them as either relevant or irrelevant.”– subjective bias hopefully limited by using three Java

programmers

20/04/23 Dr Andy Brooks 23

5.2 Calculating the Method Relevance Score

Weighting name, statement and binary

• After training the model, the method relevance score (the probability that a method is relevant to a query) is:

20/04/23 Dr Andy Brooks 24

5.2 Calculating the Method Relevance Score

0.5 2.5* 0.5*

0.5 2.5* 0.5*1

bin name statement

bin name statement

ep

e

• bin, binary (1 if a java file exists, otherwise 0)• name = ∑tf-idf for each query term in the method name• statement = ∑tf-idf for each query term in a method

statement divided by the number of method statements.

Comments

• Using comments did not improve model prediction.

• “... we focussed the model on the simplest variables that best predicted relevance: binary, name, and statement.

20/04/23 Dr Andy Brooks 25

5.2 Calculating the Method Relevance Score

Andy says: it would be useful to know more about the comments in the methods used for training the model. It seems very counter-intuitive that using comments did not improve model prediction.

6.1.1 Variables and measures

• Method scoring technique is the independent variable:– Suade

• Robillard´s structural topology approach

– Dora • combined lexical and structural approach

– boolean AND• 1 if all query terms appear in the method

– boolean OR• 1 if any query term appears in the method

20/04/23 Dr Andy Brooks 26

6.1 Experimental Design

6.1.1 Variables and measures

• Effectiveness is the dependent variable as measured by precision and recall.

20/04/23 Dr Andy Brooks 27

6.1 Experimental Design

_ _ _ _

_ _ _ _

number of relevant methods reportedprecision

total number of methods reported

_ _ _ _

_ _ _ _

number of relevant methods reportedrecall

total number of relevant methods

High precision implies few irrelevant methods reported.High recall implies only a few relevant methods are missed.

6.1.2 Subjects

• To avoid investigator bias, 8 concerns were taken from another study.

• Methods were selected from these 8 concerns by 3 independent developers.

• There were, however, varying levels of agreement between the 3 independent developers. See Table 1.– ∩3 is the number agreed by all three

– ∩2 is the number agreed by at least two

• “We considered any method selected by at least two developers to be relevant to a concern.”

20/04/23 Dr Andy Brooks 28

6.1 Experimental Design

oracle

Table 1: Concerns and queries used in the evaluation ©ACM

20/04/23 Dr Andy Brooks 29

6.1 Experimental Design

U union

Andy says: The accurate identification of relevant methods is crucial to the experiment. Table 1, however, reveals relatively poor agreement between the three developers. This is a big weakness of the experiment and many would dismiss the evaluation.

Queries

• To avoid investigator bias, the queries shown in Table 1 were written by an independent researcher “who had no knowledge of our scoring technique”.

• “The queries were selected by looking at concern descriptions, lexical searches, and a query expansion mechanism [37].”

20/04/23 Dr Andy Brooks 30

6.1 Experimental Design

Andy says: The queries seem reasonable but it would be useful to know what the results would be like for different queries e.g. one term instead of two or two instead of three.

6.1.3 Methodology

• The evaluation was restricted to one edge away from a single seed method.

• For each method m in the 8 concerns, each scoring technique was applied to all the callers and callees of m.– precision and recall calculated for each m

• A total of 1885 call edges were evaluated by each scoring technique.– Suade, Dora, boolean AND, boolean OR

20/04/23 Dr Andy Brooks 31

6.1 Experimental Design

6.1.3 Methodology

• Suade and Dora work with thresholds.– boolean AND and OR require no threshold

• Figure 3 is a precision-recall graph where the threshold has been systematically varied from 0 to 1 in intervals of 0,005.

• Scoring was only applied to one edge away from each seed, so setting Dora´s t2 threshold was unnecessary.

20/04/23 Dr Andy Brooks 32

6.1 Experimental Design

Figure 3: Precision-Recall Graph

• Each point represents precision and recall averaged over a given threshold. (Threshold decreasing left to right.)

20/04/23 Dr Andy Brooks 33

6.2 Results

Andy says: boxplots would have been better rather than simple averages.

(59,43) t1 = 0,5

(23,23) t1 = 0,99

6.2.1 Precision, Recall, and Threshold Variation

• AND performs similarly to Dora at higher than optimal threshold (t1 = 0,99)– On average AND performed badly but AND performs well

for concerns 9 and 12.• “AND is clearly very sensitive to the query.”

• OR performs similarly to Dora at less than optimal threshold (t1 < 0,5)– 60% recall against 59%– 32% precision against 43%

• Dora outperforms Suade in terms of precision over many thresholds.

20/04/23 Dr Andy Brooks 34

6.2 Results

Andy says: The simple boolean OR seems quite effective!

6.3 Threats to Valdity

• The 8 concerns were from only 3 open source programs so the results might not generalize.– Andy says: I agree. More experimental work is needed.

• The selection of methods relevant to a concern was subjective.– “It is possible that the developers missed relevant methods or

even included irrelevant ones.”– “Since all the techniques are subject to the same vulnerability,

we do not feel this is a serious threat to the validity of the study.”– Andy says: I disagree. The threat to validity is serious.

20/04/23 Dr Andy Brooks 35

Remarks by Andy

• The key idea in Dora is sound: to prune away structurally-connected but otherwise irrelevant code.

• The scope of a single maintenance request can vary from a single statement to the system in its entirety. An explicit model of maintenance requests is, however, absent in this work. Without such a model, method relevance cannot be judged safely. – Table 1 clearly demonstrates the difficulty of assessing relevance.

• Boolean OR was quite effective. On average, out of every 10 methods returned, around 3 are relevant as opposed to 4 from Dora. – Should we just keep it simple and rely on boolean OR?

20/04/23 Dr Andy Brooks 36