source code clone search (iman keivanloo phd seminar)

49
Internet-scale Source Code Search and Analysis Framework Iman Keivanloo Advisor: Dr. Juergen Rilling hD Seminar mputer Science and Software Engineering Department vember-17-2011

Upload: imanmahsa

Post on 31-May-2015

2.801 views

Category:

Technology


1 download

DESCRIPTION

Source Code Clone Search and Detection (SeClone is a Real-time and Internet-scale Clone Search and Detection). *There are some animations in the presentation, to see them download and run it locally.

TRANSCRIPT

Page 1: Source Code Clone Search (Iman keivanloo PhD seminar)

Internet-scale Source Code Search and Analysis Framework

Iman Keivanloo

Advisor:Dr. Juergen Rilling

PhD SeminarComputer Science and Software Engineering DepartmentNovember-17-2011

Page 2: Source Code Clone Search (Iman keivanloo PhD seminar)

2

Agenda

• Research Context

• Major questions & answers

• Next step

• Conclusion

• Time Table

Page 3: Source Code Clone Search (Iman keivanloo PhD seminar)

3

Research Context

“is searching the Internet for source code to help solve a software development problem”

Internet-Scale Code Search

[Gallardo, SUITE’09]

Page 4: Source Code Clone Search (Iman keivanloo PhD seminar)

4

How to search for Source Code?

• Free-form Query:

– “how to write into file in Java”

• Structural Query: – “select col1 from table1 where col1=“%write”

[Keivanloo, ICSM’10][Keivanloo, SUITE’11]

Page 5: Source Code Clone Search (Iman keivanloo PhD seminar)

5

Research Focus

Suggested simplified query:Select line which has

(1) a method call statement on the trigger method.

...11: CSVReadFile csvData=new CSVReadFile(“input.csv”);12: myWindow.trigger(csvData);13: OutputStream o=new OutputStream();…

...59: Event e=new Event(50);60: e.trigger();61: e.update();...

...133: Listener res=new Listener();134: res.trigger(“warm-up”);135: res.close();...

...55: Window r=new Window();56: long timestamp=System.Now();57: System.out.println(“Start reasoning...”);58: XMLStream xmldata=new XMLStream(io);59: r.trigger(xmldata);60: OutputStream o=new OutputStream();61: r.flush(o);…

…89: Window var=new Window();90: XMLReadFile r=new XMLReadFile (“k.xml”);91: OutputStream o=new OutputStream();92: var.trigger(r);93: var.flush(o);…

Gapped clone

Unordered core

The pattern is similar but it uses

XMLStream instead of XMLFile as the

input

This match is acceptable, even if

the order is different from the 1:1 match

Internet-Scale Structural Code Search Engine

This line looks like a match, however it uses .CSV instead of .XML. We can use our clone search engine to find now other similar code fragments to this one.

Real-time Clone Search Engine...10: Window myWindow=new Window();11: CSVReadFile csvData=new CSVReadFile(“...12: myWindow.trigger(csvData);13: OutputStream o=new OutputStream();14: myWindow.flush(o);15: myWindow.close();...

Step 2: Input [the selected fragment in the first step and its target line (red)]

Step 1: Input [the simplified structural query]

XMLReadFile inFile=new XMLReadFile(“kb.xml”);Window myWindow=new Window();myWindow.trigger(inFile);OutputStream result=new OutputStream();myWindow.flush(result);

The ideal expected asnwer

Similar Fragment Search

Page 6: Source Code Clone Search (Iman keivanloo PhD seminar)

6

Research Challenge

Page 7: Source Code Clone Search (Iman keivanloo PhD seminar)

7

The Web Search Challenge

Page 8: Source Code Clone Search (Iman keivanloo PhD seminar)

8

But Often Still Fail to Deliver the Expected Results After 10 Years of Research

Page 9: Source Code Clone Search (Iman keivanloo PhD seminar)

9

No Ambiguity!

Page 10: Source Code Clone Search (Iman keivanloo PhD seminar)

10

Early Conclusion

Source Code Search is similar to Web Search

Page 11: Source Code Clone Search (Iman keivanloo PhD seminar)

11

Early Conclusion

Source Code Search is similar to Web Search

1. Search techniques = ?

2. Ambiguity resolution techniques = Code AnalysisAnalysis (Ambiguity resolution)

Search

Page 12: Source Code Clone Search (Iman keivanloo PhD seminar)

12

Research Approach Overview

Internet-scale Source Code Search and Analysis FrameworkAnalysisSearch

Semantic Web-based Code Analysis

Code Clone Search

Page 13: Source Code Clone Search (Iman keivanloo PhD seminar)

Definitions & Requirements

Search

Page 14: Source Code Clone Search (Iman keivanloo PhD seminar)

14

Clone (Source Code Clone)

• Similar code fragments

• Type 1: Identical except whitespaces …• Type 2: Identical except variable names ...• Type 3: Identical except a few missing…• Type 4: Similar functionality

[Roy, C. K., Cordy, J. R., & Koschke, R. (2009). Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of Computer Programming, 2009.]

for (AttributeEntity theAttributeEntity:aTableEntity.ge…System.out.println(“Hello!");

for (AttributeEntity theAttributeEntity:aTableEntity.ge…System.out.println(“Hello!");

Page 15: Source Code Clone Search (Iman keivanloo PhD seminar)

16

Clone Search

Query Code Database

for (Attribute attribute:exampleSet.getAttributes()) System.out.println(“Hello!");

for (A

ttribute

attrib

ute:e

xample

Set.g

etAttrib

ute

s())

Syst

em.o

ut.p

rintln

(“Hi!"

);

for

for

(Att

ribut

eEnti

ty

theA

ttrib

uteE

ntity

:aTa

bleE

ntity

.ge…

Syst

em

.out.

pri

ntl

n(“

Hello

!");

for (IAttribute att:source.getAttributes()) {System.out.println("Please do not read me");

for (JAttribute attribute:formType.getAttributes()) System.out.println(“Test");

for (A

ttribu

teEn

tity

theA

ttribute

Entity:a

Ta

ble

Entity.g

e…

Syste

m.o

ut.p

rintln

(“H

ello

!");

for (Attribute attribute:es1.getAttributes()) System.out.println(“Test");

for (

Attrib

ute attr

ibute:exampleSet.g

etAttr

ibutes()

)

Syste

m.out.prin

tln(“T

he end");

Page 16: Source Code Clone Search (Iman keivanloo PhD seminar)

17

Clone Search

Query

Answer

Page 17: Source Code Clone Search (Iman keivanloo PhD seminar)

18

Internet-scale Clone Search

Query

for (Attribute attribute:exampleSet.getAttributes()) System.out.println(“Hello!");

for (A

ttribute

attrib

ute:e

xample

Set.g

etAttrib

ute

s())

Syst

em.o

ut.p

rintln

(“Hi!"

);

for

for

(Att

ribut

eEnti

ty

theA

ttrib

uteE

ntity

:aTa

bleE

ntity

.ge…

Syst

em

.out.

pri

ntl

n(“

Hello

!");

Page 18: Source Code Clone Search (Iman keivanloo PhD seminar)

19

Internet-scale Real-time Clone Search

Page 19: Source Code Clone Search (Iman keivanloo PhD seminar)

20

Internet-scale Real-time Clone Search

Requirements?

Page 20: Source Code Clone Search (Iman keivanloo PhD seminar)

21

Internet-scale Real-time Clone Search

Millions LOC~ 300 MLOC

Requirements:

Page 21: Source Code Clone Search (Iman keivanloo PhD seminar)

22

Internet-scale Real-time Clone Search

Millions LOC

Requirements:100

Milliseconds

Page 22: Source Code Clone Search (Iman keivanloo PhD seminar)

23

Internet-scale Real-time Clone Search

Millions LOC

Requirements:100

Milliseconds

•Precision• Recall•Type-1, 2, 3…

for (IAttribute att:source.getAttributes()) {System.out.println("Please do not read me");

for (JAttribute attribute:formType.getAttributes()) System.out.println(“Test");

for (AttributeEntity

theAttributeEntity:aTableEntity.ge

…System.out.println(“Hello!");

for (Attribute attribute:es1.getAttributes())

System.out.println(“Test");

Page 23: Source Code Clone Search (Iman keivanloo PhD seminar)

24

Internet-scale Real-time Clone Search

Millions LOC

Requirements:100 Milliseconds

Precision RecallType-1, 2, 3…

Page 24: Source Code Clone Search (Iman keivanloo PhD seminar)

Is it actually possible?Real-time answer (faster than 100 ms)

Rese

arch

Que

stion

#1

Page 25: Source Code Clone Search (Iman keivanloo PhD seminar)

26

• SeClone: An Internet-scale Real-time Clone Search Engine

Our Initial Analysis

Search

AnalysisPhase 1 Phase 2

[Keivanloo, ICPC’11]

Page 26: Source Code Clone Search (Iman keivanloo PhD seminar)

27

Inside SeClone

Phase 1• Syntactical Pattern matching

Phase 1 Phase 2Phase 1Pattern Matching

Page 27: Source Code Clone Search (Iman keivanloo PhD seminar)

28

Inside SeClone

Phase 2• Information Retrieval & Clustering algorithm

1 for (Attribute attribute:exampleSet.getAttributes()) System.out.println(“The end");

2 for (Attribute attribute:es1.getAttributes()) System.out.println(“Test");

3 for (AttributeEntity theAttributeEntity:aTableEntity.ge…System.out.println(“Hello!");

4 for (JAttribute attribute:formType.getAttributes()) {System.out.println(“Test");

5 for (IAttribute att:source.getAttributes()) {System.out.println("Please do not read me");

Phase 1Pattern Matching

Phase 2Semantic Matching

Page 28: Source Code Clone Search (Iman keivanloo PhD seminar)

The DilemmaHow to distribute the 100 milliseconds between

phases?

Pattern Matching Semantic Matching

0 25 50 75 100

Rese

arch

Que

stion

#2

[Keivanloo, WCRE’11]

Page 29: Source Code Clone Search (Iman keivanloo PhD seminar)

30

Our Further Analysis [WCRE’11]

• 100 Milliseconds• Millions LOC• Precision• Recall• Type-1, 2, 3…

Pattern Matching Semantic Matching

0 25 50 75 100

The Dilem

maCo

nstr

aint

s

Requ

irem

ents

SeCl

one

[ICPC

11]

Dat

a Ch

arac

teris

tics

O ( p * log n )

Page 30: Source Code Clone Search (Iman keivanloo PhD seminar)

31

Source Code Characteristics

Page 31: Source Code Clone Search (Iman keivanloo PhD seminar)

32

Analysis of the Data Characteristics: Dataset preparation

• Name: IJaDataset– Comprehensive (Inter-project)

• To avoid project-specific result

– ~18,000 Projects– 1,500,000 unique Java classes

• No duplicate, empty, buggy file

– ~300 MLOC

• online at http://aseg.cs.concordia.ca/seclone

Page 32: Source Code Clone Search (Iman keivanloo PhD seminar)

33

Analysis of the Data Characteristics: Granularity Effect

• Three Level Similarity (TLS): Set of similar three-line fragments

• First Level Similarity (FLS): single-line patterns

Page 33: Source Code Clone Search (Iman keivanloo PhD seminar)

34

Analysis of the Data Characteristics: Clone frequency

• How many code fragment are analyzed by each query?

• Answer: 3 (Average)

Page 34: Source Code Clone Search (Iman keivanloo PhD seminar)

35

Analysis of the Data Characteristics: Clone frequency

• Observation result:– TLS distributes the candidates into 3.9 times more groups– Its group size is 6 times smaller than FLS

Page 35: Source Code Clone Search (Iman keivanloo PhD seminar)

36

Analysis of the Data Characteristics: Clone frequency

• Conclusion:– TLS heuristic is practical for real-time clone search,

as long as the outliers are handled properly– Why?• (1) each TLS group has 2.37 members on average• (2) it distributes candidates in small-size groups• (3) for each query, only one group must be evaluated

Page 36: Source Code Clone Search (Iman keivanloo PhD seminar)

37

What Does an Outlier Look Like?

• Outlier Definition: patterns with more than 2,000 occurrences

• Observation result:• Only ~1000 patterns out of 30M• ~ 0.01% patterns• Mostly insignificant code patterns

Page 37: Source Code Clone Search (Iman keivanloo PhD seminar)

38

Analysis of the Data Characteristics: Sampling efficiency

• Can sampling be used to reduce the amount of data being analyzed?

• Answer: Yes (e.g., 33% contains 91% of popular patterns)

Page 38: Source Code Clone Search (Iman keivanloo PhD seminar)

39

Analysis of the Data Characteristics: Indexing

• Can 32bit Hash keys (versus MD5) be used without affecting index quality?

abc 123 abc 123 aXc 456 aXc 123

• Answer: Yes 0.002% error rate

Only 10 cases for same key for three distinct strings

Page 39: Source Code Clone Search (Iman keivanloo PhD seminar)

40

Method Names Are Reliable?

• Input Data: Koders 1-year query log– ~10M records

• Observation purpose:– Importance of method names

• Observation result:– 98% success rate vs. 69%

• Result interpretation:– Method names in this context are reliable source of information– They must be preserved to increase precision

Page 40: Source Code Clone Search (Iman keivanloo PhD seminar)

41

Source Code Search Framework

Page 41: Source Code Clone Search (Iman keivanloo PhD seminar)

42

Internet-scale Real-time Code Clone Search via Multi-level Indexing

– Internet-scale & Speed• 32-bit Hash values

– Type-3 clone• Multi-level indexing

– Customized for Internet-scale Code Search• Special transformation rule

Page 42: Source Code Clone Search (Iman keivanloo PhD seminar)

43

Response Time (Pattern Matching) [WCRE’11]

• Regular queries– 25 microseconds

• 99.99% queries– 900 microseconds

Page 43: Source Code Clone Search (Iman keivanloo PhD seminar)

44

Conclusion

Page 44: Source Code Clone Search (Iman keivanloo PhD seminar)

45

Answer:Research Question #1

Internet-scale Real-time Code Search Is Possible?

YES

Page 45: Source Code Clone Search (Iman keivanloo PhD seminar)

The DilemmaHow to distribute the 100 milliseconds between phases?

Pattern Matching Semantic Matching

0 25 50 75 100

1 millisecond 99 milliseconds

Answer:

Answer:Research Question #2

Page 46: Source Code Clone Search (Iman keivanloo PhD seminar)

Pattern Matching Semantic Matching

0 25 50 75 100

99 milliseconds

Research Opportunity

Analysis

Page 47: Source Code Clone Search (Iman keivanloo PhD seminar)

48

SummaryStep 1

• Studied characteristics of source code on the Internet– unique patterns distribution (sampling application)– Pattern frequencies (multi-level search)– 32-bit hashing strength (code pattern)– Outlier patterns– Method name importance

Step 2• Designed an Internet-scale clone search

– Customized for code search (precision)– Fine granularity– Multi-level Indexing approach (Type-3 clone)– Microsecond range response time (up to 10 times faster)

Page 48: Source Code Clone Search (Iman keivanloo PhD seminar)

49

PublicationCode Clone Search and Detection (http://aseg.cs.concordia.ca/seclone/)

• Iman Keivanloo, Juergen Rilling, Philippe Charland. Internet-scale Real-time Code Clone Search via Multi-level Indexing. 18th Working Conference on Reverse Engineering (WCRE 2011), Lero, Limerick , Ireland.

• Iman Keivanloo, Juergen Rilling, Philippe Charland. SeClone – A Hybrid Approach to Internet-Scale Real-Time Code Clone Search. 19th IEEE International Conference on Program Comprehension (ICPC 2011), Kingston, Ontario, Canada.

Source Code Sharing using Linked Data (secold.org)• Iman Keivanloo, Chris Forbes, Juergen Rilling, and Philippe Charland, "Towards Sharing Source Code Facts Using

Linked Data," ICSE Workshop on Search-Driven Development: Users, Infrastructure, Tools and Evaluation (SUITE). 2011.

Source Code Search (http://aseg.cs.concordia.ca/codesearch)• Iman Keivanloo, Laleh Roostapour, Philipp Schugerl, Juergen Rilling. Semantic Web-based Source Code Search. 6th

International Workshop on Semantic Web Enabled Software Engineering (SWESE 2010), June 35, San Francisco, USA. • Iman Keivanloo, Laleh Roostapour, Philipp Schugerl, Juergen Rilling. SE-CodeSearch: A Scalable Semantic Web-based

Source Code Search Infrastructure. 26th IEEE International Conference on Software Maintenance (ICSM), Early Research Achievements (ERA) Track, Sept. 12-18, Timișoara, Romania.

Page 49: Source Code Clone Search (Iman keivanloo PhD seminar)

50

QUESTION?Thank you for your kind attention

PhD SeminarComputer Science and Software Engineering DepartmentNovember-17-2011