detecting similar software applications - william &...
TRANSCRIPT
![Page 1: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/1.jpg)
Detecting Similar Software Applications
Collin McMillan, Mark Grechanik, and Denys PoshyvanykThe College of William and Mary and
The University of Illinois at Chicago
![Page 2: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/2.jpg)
Find Identical Penmen
![Page 3: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/3.jpg)
Finding Similar Web Pages
![Page 4: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/4.jpg)
Similar Web Pages For ACM SigSoft
![Page 5: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/5.jpg)
Similar Web Pages For Microsoft
Open-source free software!
![Page 6: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/6.jpg)
Similar Applications
Software applications are similar if theyimplement related semantic requirements
![Page 7: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/7.jpg)
Example: RealPlayer and Windows Player
![Page 8: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/8.jpg)
Why Detect Similar Applications?
![Page 9: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/9.jpg)
Economic Importance of Detecting Similar Applications• Consulting companies
accumulated tens of thousands of software applications in their repositories that they have built for the past 50 years!
• These applications constitute a knowledge treasure for reusing it from successfully delivered applications in the past.
• Detecting similar applications and reusing their components will save time and resources and increase chances of winning future bids.
![Page 10: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/10.jpg)
An Overview of the Process
InputApplication
Detector ofSimilar Application
SimilarApplications
SoftwareRepository
![Page 11: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/11.jpg)
Spiral Model, Bidding, and Prototyping
![Page 12: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/12.jpg)
Spiral Model, Bidding, and Prototyping
Building prototypes repeatedly from scratch is expensive since these prototypes are often discarded after
receiving feedback from stakeholders
![Page 13: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/13.jpg)
Since prototypes are approximations of desired resulting applications, similar applications from software repositories can serve as prototypes because they are relevant to your requirements
Since prototypes are Since prototypes are approximationsapproximations of desired of desired resulting applications, similar resulting applications, similar applications from software applications from software repositories can serve as repositories can serve as prototypesprototypes because they are because they are relevant to your requirementsrelevant to your requirements
![Page 14: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/14.jpg)
Problem
Detecting similar applications in a timely manner can lead to significant economic benefits.
Two applications are similar to each other if they implement some features that are described by the same abstraction.
Mismatch between the high-level intent reflected in the descriptions of these applications and low-level implementation details.
Programmers rarely choose meaningful names that reflectcorrectly the concepts or abstractions that they implement
![Page 15: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/15.jpg)
Detecting Similar Applications Is Very Difficult
Currently, detecting similar applications islike looking for a needle in a stack of hay!
![Page 16: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/16.jpg)
Mizzaro’s Conceptual Framework
![Page 17: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/17.jpg)
Our Hypothesis
![Page 18: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/18.jpg)
Closely reLated ApplicatioNs
Find proper weights for these semantic anchors
Detect co‐occurrences ofsemantic anchors that
form patterns of implementing
different requirements.
Find reliable semantic anchors How to do that?
![Page 19: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/19.jpg)
CLAN!!!
Closely reLated ApplicatioNs (CLAN) – www.javaclan.net
Find proper weights for these semantic anchors
Detect co‐occurrences ofsemantic anchors that
form patterns of implementing
different requirements.
Find reliable semantic anchors
![Page 20: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/20.jpg)
Latent Semantic Analysis (LSA)
![Page 21: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/21.jpg)
Latent Semantic Analysis (LSA)
document
term m x n =
dims
term m x r x
dimsdi
ms
r x r xdocument
dim
s
r x n
LSA
![Page 22: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/22.jpg)
The Architecture of CLAN
MetadataExtractor
APIArchive
AppsArchive
ApplicationsMetadata TDM Builder
TDMP TDMC
LSIAlgorithm
SearchEngine
||P||
||C||SimilarityMatrix
![Page 23: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/23.jpg)
CLAN UI
![Page 24: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/24.jpg)
• Goal: to compare CLAN with MUDABlue and Combined• MUDABlue [Kawaguchi’06] provides automatic categorization of applications using underlying words in source code• Implemented a feature of MUDABlue for computing similarities among apps using ALL IDENTIFIERS from source code• Combined CLAN + MUDABlue = Combined (Words + APIs)• Instantiated CLAN, MUDABlue and Combined on the same repository of 8,310 Java applications
Empirical Evaluation
![Page 25: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/25.jpg)
• A user study with 33 Java student programmers from the University of Illinois in Chicago
• 21 graduate students;• 12 upper‐level undergraduate students;• 15 participants reported between 1‐3 years of Java programming experience
• 11 participants reported more than 3 years of Java programming experience
• 16 participants reported prior experience with search engines
• 8 reported that they never used code search engines
Empirical Evaluation
![Page 26: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/26.jpg)
Cross‐Validation Design
Experiment Group Approach Task Set
1ABC
CLANMUDABlueCombined
T1T2T3
2ABC
CombinedCLAN
MUDABlue
T2T3T1
3ABC
MUDABlueCombinedCLAN
T3T1T2
![Page 27: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/27.jpg)
Large Case Studies are Rare
“First, it is very difficult to scale human experiments to get quantitative, significant measures of usefulness; this type of large‐scale human study is very rare.
Second, comparing different recommenders using human evaluators would involve carefully designed, time‐consuming experiments; this is also extremely rare.”
Saul, Filkov, Devanbu, BirdRecommending Random Walks, ESEC/FSE‘07
![Page 28: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/28.jpg)
1) Receive Task and search for Apps using the Search Engine
2) Translate Task to Query, Enter into Search Engine
3) Identify the relevant source App4) Find target applications using a similarity Engine
Participants’ Role
Recording music data into a MIDI file
![Page 29: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/29.jpg)
1) Completely irrelevant – there is absolutely nothing that the participant can use from this retrieved code fragments, nothing in it is related to keywords that the participant chose based on the descriptions of the tasks.
2) Mostly irrelevant – a retrieved code fragment is only remotely relevant to a given task; it is unclear how to reuse it.
3) Mostly relevant – a retrieved code fragment is relevant to a given task and participant can understand with some modest effort how to reuse it to solve a given task.
4) Highly relevant – The participant is highly confident that code fragment can be reused and s/he clearly see how to use it.
Likert Scale ‐ Confidence
![Page 30: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/30.jpg)
Metrics:Confidence (C) Precision (P)
Analysis of the Results
Similarity Engine Apps Entered Apps Rated
CLAN 33 304
MUDABlue 33 322
Combined 33 322
![Page 31: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/31.jpg)
Null hypothesis (H0): There is no difference in the values of confidence level and precision per task between participants who use MUDABlue, Combined, and CLAN.
Alternative hypothesis (H1): There is statistically significant difference in the values of confidence level and precision between participants who use MUDABlue, Combined, and CLAN.
Hypotheses
![Page 32: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/32.jpg)
H1: Confidence of CLAN vs. MUDABlueH2: Precision of CLAN vs. MUDABlueH3: Confidence of CLAN vs. CombinedH4: Precision of CLAN vs. CombinedH5: Confidence of MUDABlue vs. CombinedH6: Precision of MUDABlue vs. Combined
Hypotheses Tested
![Page 33: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/33.jpg)
Results – Confidencep < 4.4·10-7
F 5.02Fcrit 1.97
![Page 34: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/34.jpg)
Results – Precision
p < 0.02
F 2.43
Fcrit 2.04
![Page 35: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/35.jpg)
H1: Confidence of CLAN vs. MUDABlueH2: Precision of CLAN vs. MUDABlueH3: Confidence of CLAN vs. CombinedH4: Precision of CLAN vs. CombinedH5: Confidence of MUDABlue vs. CombinedH6: Precision of MUDABlue vs. Combined
Accepted and Rejected Alternative Hypotheses
![Page 36: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/36.jpg)
“This search engine is better than MUDABlue because of the extra information provided within the results.”
“I think this is a helpful tool in finding the code one is looking for, but it can be very hit or miss. The hits were very relevant (4’s) and the misses were completely irrelevant (1’s or 2’s).”
“Good comparison of API calls.”
“By using API calls I was able to compare the applications very easily.”
Responses from Programmers
![Page 37: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/37.jpg)
“However, it would be nice to see within the results the actual code, which made calls to function X or used library X”
“While this search engine finds apps which use relevant libraries it does not make it easy to find relevant sections within those projects. It would be helpful if there was functionality to better analyze the results”
“Rank API calls, ignore less significant API calls to return better relevant search results.”
Suggestions from Programmers
![Page 38: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/38.jpg)
• Participants: proficiency in Java, development experience, and motivation
• Selecting tasks for the experiment: too general or too specific?
• On the use of Java SDK APIs
Threats to Validity
![Page 39: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/39.jpg)
All Engines are Publicly AvailableCLAN: http://www.javaclan.net/MUDABlue: http://www.mudablue.net/Combined: http://clancombined.net/
Case Study Tasks and Responses are available:http://www.cs.wm.edu/semeru/clan/
Improving User InterfaceComparison of API calls, show source codeGenerate explanations on why apps are similar
Ongoing Improvements
![Page 40: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/40.jpg)
Conclusions
![Page 41: Detecting Similar Software Applications - William & Marydenys/pubs/talks/ICSE'12-CLAN.pdf · Detecting Similar Software Applications Collin McMillan, Mark Grechanik, and Denys Poshyvanyk](https://reader033.vdocuments.site/reader033/viewer/2022060405/5f0f27f77e708231d442c398/html5/thumbnails/41.jpg)
http://www.javaclan.net