Transcript
Page 1: Introducing Natural Language Program Analysis

Introducing Natural Language Program Analysis

Lori Pollock, K. Vijay-Shanker, David Shepherd,

Emily Hill, Zachary P. Fry, Kishen Maloor

Page 2: Introducing Natural Language Program Analysis

NLPA Research Team Leaders

Lori Pollock“Team Captain”

K. Vijay-Shanker“The Umpire”

Page 3: Introducing Natural Language Program Analysis

ProblemModern software is large and complex

object oriented class hierarchy

Software development tools are needed

Page 4: Introducing Natural Language Program Analysis

Successes in Software Development Tools

object oriented class hierarchy

Good with local tasks

Good with traditional structure

Page 5: Introducing Natural Language Program Analysis

object oriented class hierarchy

Scattered tasks are difficult

Programmers use more than traditional program structure

Issues in Software Development Tools

Page 6: Introducing Natural Language Program Analysis

public interface Storable{...

activate tool

save drawing

update drawing

undo action

public void Circle.save()

//Store the fields in a file....

object oriented system

Key Insight: Programmers leave natural language clues that

can benefit software development tools

Observations in Software Development Tools

Page 7: Introducing Natural Language Program Analysis

Studies on choosing identifiers

Impact of human cognition on names [Liblit et al. PPIG 06] Metaphors, morphology, scope, part of speech hints Hints for understanding code

Analysis of Function identifiers [Caprile and Tonella WCRE 99] Lexical, syntactic, semantic Use for software tools: metrics, traceability, program understanding

Carla, the compiler writer Pete, the programmer

I don’t care about names.

So, I could use x, y, z. But, no one

will understandmy code.

Page 8: Introducing Natural Language Program Analysis

Our Research Path

[MACS 05, LATE 05]

[AOSD 06]

[ASE 05, AOSD 07, PASTE 07]

Motivated usefulness of exploiting natural language (NL) clues in toolsDeveloped extraction process and an NL-

based program representationCreated and evaluated a concern

location tool and an aspect miner with NL-based analysis

Page 9: Introducing Natural Language Program Analysis

pic

Name: David C ShepherdNickname: Leadoff HitterCurrent Position: PhD May 30, 2007Future Position: Postdoc, Gail Murphy

StatsYear coffees/day redmarks/paper draft2002 0.1 5002007 2.2 100

Page 10: Introducing Natural Language Program Analysis

Aspect Mining

Aspect-Oriented Programming

Aspect Mining TaskLocate refactoring

candidates

Applying NL Clues for

Molly, the Maintainer

How can I fix Paul’s

atrocious code?

Page 11: Introducing Natural Language Program Analysis

Timna: An Aspect Mining Framework [ASE 05]

Uses program analysis clues for mining Combines clues using machine learning Evaluated vs. Fan-in Precision (quality) and Recall (completeness)

P R 37 2 62 60

Fan-InTimna

Page 12: Introducing Natural Language Program Analysis

iTimna (Timna with NL) Integrates natural language cluesExample: Opposite verbs (open and close)

P R 37 2 62 60 81 73

Fan-InTimna iTimna

Integrating NL Clues into Timna

Natural language information increases the effectiveness of Timna[Come back Thurs 10:05am]

Page 13: Introducing Natural Language Program Analysis

Concern Location

60-90% software costs spent on reading and navigating code for maintenance*

(fixing bugs, adding features, etc.)

*[Erlikh] Leveraging Legacy System Dollars for E-Business

Applying NL Clues for

Motivation

Page 14: Introducing Natural Language Program Analysis

Key Challenge: Concern Location

Find, collect, and understand all source code related to a particular concept

Concerns are often crosscutting

Page 15: Introducing Natural Language Program Analysis

State of the Art for Concern Location

Mining Dynamic Information [Wilde ICSM 00]

Program Structure Navigation [Robillard FSE 05, FEAT, Schaefer ICSM 05]

Search-Based Approaches RegEx [grep, Aspect Mining Tool 00]

LSA-Based [Marcus 04]

Word-Frequency Based [GES 06]

Reduced to similar problem

Slow

Fast

Fragile

Sensitive

No Semantics

Page 16: Introducing Natural Language Program Analysis

Limitations of Search Techniques

1. Return large result sets

2. Return irrelevant results

3. Return hard-to-interpret result sets

Page 17: Introducing Natural Language Program Analysis

The Find-Concept Approach

concept

Find-ConceptConcrete query

Recommendations

Source Code

Method a

Method bMethod c

Method d Method e

NL-basedCode Rep

Result GraphNatural

Language Information

1. More effective search

2. Improved search terms

3. Understandable results

Page 18: Introducing Natural Language Program Analysis

Underlying Program Analysis

Action-Oriented Identifier Graph (AOIG) [AOSD 06] Provides access to NL information Provides interface between NL and traditional

Word Recommendation Algorithm NL-based

Stemmed/Rooted: complete, completing Synonym: finish, complete

Combining NL and Traditional Co-location: completeWord()

Page 19: Introducing Natural Language Program Analysis

Experimental Evaluation

Research Questions Which search tool is most effective at forming and

executing a query for concern location? Which search tool requires the least human effort to form

an effective query?

Methodology: 18 developers complete nine concern location tasks on medium-sized (>20KLOC) programs

Measures:Precision (quality), Recall (completeness), F-Measure (combination of both P & R)

Find Concept, GES, ELex

Page 20: Introducing Natural Language Program Analysis

Overall Results

Effectiveness FC > Elex with statistical

significance FC >= GES on 7/9 tasks FC is more consistent than GES

Effort FC = Elex = GES

FC is more consistent and more effective in experimental study without requiring more effort

Across all tasks

Page 21: Introducing Natural Language Program Analysis

Natural Language Extraction from Source Code

Key Challenges:Decode name usageDevelop automatic extraction

processCreate NL-based program

representation

Molly, the Maintainer

What was Pete thinking

when he wrote this code?

Page 22: Introducing Natural Language Program Analysis

Natural Language: Which Clues to Use?

Software MaintenanceTypically focused on actionsObjects are well-modularized

Maintenance Requests

Page 23: Introducing Natural Language Program Analysis

Natural Language: Which Clues to Use?

Software MaintenanceTypically focused on actionsObjects are well-modularized

Focus on actions Correspond to verbsVerbs need Direct Object

(DO)

Extract verb-DO pairs

Page 24: Introducing Natural Language Program Analysis

Extracting Verb-DO Pairs

Two types of extractionclass Player{ /** * Play a specified file with specified time interval */ public static boolean play(final File file,final float fPosition,final long length) { fCurrent = file; try { playerImpl = null; //make sure to stop non-fading players stop(false); //Choose the player Class cPlayer = file.getTrack().getType().getPlayerImpl(); …}

Extraction from comments

Extraction from method signatures

Page 25: Introducing Natural Language Program Analysis

public UserList getUserListFromFile( String path ) throws IOException {

try {

File tmpFile = new File( path );

return parseFile(tmpFile);

} catch( java.io.IOException e ) {

throw new IOrException( ”UserList format issue" + path + " file " + e );

}

}

Extracting Clues from Signatures

1. POS Tag Method Name

2. Chunk Method Name

3. Identify Verb and Direct-Object (DO)

get<verb> User<adj> List<noun> From <prep> File <noun>

get<verb phrase> User List<noun phrase> From File <prep phrase>

POS Tag

Chunk

Page 26: Introducing Natural Language Program Analysis

pic

Name: Zak FryNickname: The RookieCurrent Position: Upcoming seniorFuture Position: Graduate School

StatsYear diet cokes/day lab days/week2006 1 22007 6 8

Page 27: Introducing Natural Language Program Analysis

Developing rules for extraction

For many methods: Identify relevant verb (V)

and direct object (DO) in method signature

Classify pattern of V and DO locations

If new pattern, create new extraction rule

verbDO

verb DO

verbDO

Page 28: Introducing Natural Language Program Analysis

Our Current Extraction Rules

4 general rules with subcategories:

URL parseURL()

void mouseDragged()

void Host.onSaved()

Left Verb

Right Verb

Generic Verb

Unidentified Verb

void message() message-

hostsaved

mousedragged

URLparse

DOVerb

Page 29: Introducing Natural Language Program Analysis

Example: Sub-Categories for Left-Verb General Rule

Look beyond the method name:

Parameters, Return type, Declaring class name, Type hierarchy

Subcategory1) Standard left verb 2) No DO in method name; has parameters; non object return type3) No DO in method name; no parameters; no return type4) Creational left verb; has return type5) No DO in method name; has parameters; return type is more specific than parameters in type hierarchy6) No DO in method name; parameters are more specific than parameters in type hierarchy

2) No DO in method name; has parameters; non object return type

Verb-DO pair:

<remove, UserID>Left

Verb

Page 30: Introducing Natural Language Program Analysis

Representing Verb-DO Pairs

Action-Oriented Identifier Graph (AOIG)

verb1 verb2 verb3 DO1 DO2 DO3

verb1, DO1 verb1, DO2 verb3, DO2 verb2, DO3

source code files

use

use

use

use

use

use

useuse

Page 31: Introducing Natural Language Program Analysis

Action-Oriented Identifier Graph (AOIG)

play add remove file playlist listener

play, file play, playlist remove, playlist add, listener

source code files

use

use

use

use

use

use

useuse

Representing Verb-DO Pairs

Page 32: Introducing Natural Language Program Analysis

Evaluation of Extraction Process

Compare automatic vs ideal (human) extraction 300 methods from 6 medium open source programs Annotated by 3 Java developers

Promising Results Precision: 57% Recall: 64%

Context of Results Did not analyze trivial methods On average, at least verb OR direct object obtained

Page 33: Introducing Natural Language Program Analysis

pic

Name: Emily Gibson HillNickname: Batter on DeckCurrent Position: 2nd year PhD StudentFuture Position: PhD Candidate

StatsYear cokes/day meetings/week2003 0.2 12007 2 5

Page 34: Introducing Natural Language Program Analysis

Program Exploration

Purpose: Expedite software maintenance and program comprehension

Key Insight: Automated tools can use program structure and identifier names to save the developer time and effort

Ongoing work:

Page 35: Introducing Natural Language Program Analysis

Dora the Program Explorer*

* Dora comes from exploradora, the Spanish word for a female explorer.

DoraDora

Natural Language Query• Maintenance request• Expert knowledge• Query expansion

Natural Language Query• Maintenance request• Expert knowledge• Query expansion

Relevant Neighborhood

Program Structure• Representation

• Current: call graph• Seed starting point

Relevant Neighborhood• Subgraph relevant to query

Query

Page 36: Introducing Natural Language Program Analysis

State of the Art in Exploration

Structural (dependence, inheritance) Slicing Suade [Robillard 2005]

Lexical (identifier names, comments) Regular expressions: grep, Eclipse search Information Retrieval: FindConcept,

Google Eclipse Search [Poshyvanyk 2006]

Page 37: Introducing Natural Language Program Analysis

Motivating need for structural and lexical information

Program: JBidWatcher, an eBay auction sniping program

Bug: User-triggered add auction event has no effect

Task: Locate code related to ‘add auction’ trigger

Seed: DoAction() method, from prior knowledge

ExampleScenario

Page 38: Introducing Natural Language Program Analysis

DoNada() DoNada() DoNada() DoNada() DoNada()DoNada() DoNada()DoNada()DoNada() DoNada() DoNada()

DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()

DoNada() DoNada()DoNada() DoNada() DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()

Using only structural information

DoAction() has 38 callees, only 2/38 are relevant Relevant

Methods

Irrelevant Methods

Looking for: ‘add auction’ trigger

DoAction()

DoAdd()

DoPasteFromClipboard()

And what if you wanted to explore more than one edge away?

Locates locally relevant items, but many irrelevant

Page 39: Introducing Natural Language Program Analysis

Using only lexical information

50/1812 methods contain matches to ‘add*auction’ regular expression query

Only 2/50 are relevant

Locates globally relevant items, but many irrelevant

Looking for: ‘add auction’ trigger

Page 40: Introducing Natural Language Program Analysis

DoNada() DoNada() DoNada() DoNada() DoNada()DoNada() DoNada()DoNada()DoNada() DoNada() DoNada()

DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()

DoNada() DoNada()DoNada() DoNada() DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()

Combining Structural & Lexical Information Structural: guides exploration

from seed

Looking for: ‘add auction’ trigger

RelevantNeighborhood

DoAction()

DoPasteFromClipboard()

DoAdd()

Lexical: prunes irrelevant edges

Page 41: Introducing Natural Language Program Analysis

The Dora Approach

Determine method relevance to queryCalculate lexical-based relevance score

Low-scored methods pruned from neighborhood

Recursively explore

Prune irrelevant structural edges from seed

Page 42: Introducing Natural Language Program Analysis

Calculating Relevance Score:Term Frequency Score based on query term frequency of the method

6 query term 6 query term occurrencesoccurrences6 query term 6 query term occurrencesoccurrences

Only 2 Only 2 occurrencesoccurrences

Only 2 Only 2 occurrencesoccurrences

Query: ‘add auction’

Page 43: Introducing Natural Language Program Analysis

Weigh term frequency based on location: Method name more important than body Method body statements normalized by length

Calculating Relevance Score:Location Weights Query: ‘add auction’

?

Page 44: Introducing Natural Language Program Analysis

Dora explores ‘add auction’ trigger

From DoAction() seed:Correctly identified at 0.5 threshold

DoAdd() (0.93)DoPasteFromClipboard() (0.60)

With only one false positiveDoSave() (0.52)

Page 45: Introducing Natural Language Program Analysis

Summary

NL technology usedSynonyms, collocations, morphology, word frequencies, part-of-speech tagging, AOIG

Evaluation indicatesNatural language information shows promise for improving software development tools

Key to successAccurate extraction of NL clues

Page 46: Introducing Natural Language Program Analysis

Our Current and Future Work

Basic NL-based tools for softwareAbbreviation expanderProgram synonymsDetermining relative importance of words

Integrating information retrieval techniques

Page 47: Introducing Natural Language Program Analysis

Posed Questions for Discussion

What open problems faced by software tool developers can be mitigated by NLPA?

Under what circumstances is NLPA not useful?


Top Related