13/07/2015dr andy brooks1 fyrirlestrar 9 & 10 ccfinder: a tool to detect clones “i can just...

36
25/03/22 Dr Andy Brooks 1 Fyrirlestrar 9 & 10 CCFinder: A Tool to Detect Clones “I can just copy these lines. That is the safest thing to do. The code has been tested afterall.” “What a mess. This code has been copied, then changed a bit, all over the code base.” MSc Software Maintenance MS Viðhald hugbúnaðar

Post on 22-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

19/04/23 Dr Andy Brooks 1

Fyrirlestrar 9 & 10

CCFinder: A Tool to Detect Clones

“I can just copy these lines. That is the safest thing to do. The code has been tested afterall.”

“What a mess. This code has been copied, then changed a bit, all over the code base.”

MSc Software MaintenanceMS Viðhald hugbúnaðar

19/04/23 Dr Andy Brooks 2

Case StudyDæmisaga

ReferenceCCFinder: A Multi-Linguistic Token-based Code Clone Detection System For Large Scale Source Code, Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue, Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University.http://sel.ist.osaka-u.ac.jp/~kamiya/http://sel.ics.es.osaka-u.ac.jp/cdtools/index.html.en

einrækt

19/04/23 Dr Andy Brooks 3

Reasons For Clones• Copy-and-paste

– one of the easiest ways to re-use code– one of the safest ways to re-use code in legacy

applications as the original code base is unaltered

• Mental macro– frequently coded computations are remembered

and coded the same way• Repeated code portions for performance

– inlined code is faster than called code• Systematic code generation from a single base

– several variations of code needed

19/04/23 Dr Andy Brooks 4

The Problem With Clones

• It is difficulty to consistently modify source files with many clones.

• When a fault is found, the engineer has to identify all occurences in every subsystem.

• In large and complex systems, there can be dozens of engineers, each working on only one subsystem.

• Documenting the existence of clones as they are introduced does not happen in practise.

19/04/23 Dr Andy Brooks 5

Motivation For CCFinder

• Government software system• 1 million lines of code• 2 thousand modules• Written in COBOL and PL/I-like language• Developed over 20 years ago• Continually maintained by a large number

of engineers• Suspected that clones heavily reduce

maintainability of system

19/04/23 Dr Andy Brooks 6

Underlying Concepts CCFinder

• Industrial strength– deals with million-line size systems

• without excessive demands on time or memory• token-by-token matching more expensive than line-by-line

– several optimization technqiues employed

• Report only interesting clones– apply heuristic knowledge to remove unwanted clones

• Copy-and-paste detection– deal with variable renaming and other small changes

• Limited language dependence– easy to adapt tool to specific languages

• adaptation for Java took two person days

19/04/23 Dr Andy Brooks 7

Definitions and Terms

• A clone relation holds between two code portions if and only if they are the same sequence.

• A pair of code portions is called a clone pair if the clone relation holds between the portions.

• A clone class is a maximal set of code portions in which a clone relation holds between any pair of code portions.

• In CCFinder, clone relations are determined for transformed token sequences.

19/04/23 Dr Andy Brooks 8

a x y z b x y z c x y d 12 tokens

• Clone class C1– a x y z b x y z c x y d

• Clone Class C2– a x y z b x y z c x y d

• Note how the 3rd x y is not in C1

• Clone class C3– a x y z b x y z c x y d

• Portions are in C1 and this class is not of interest because it is not maximal.

19/04/23 Dr Andy Brooks 9

Identification Of Structures

• A code portion that begins in the middle of a function definition and ends some way through another function definition can be very difficult to rewrite as shared code.– CCFinder separates each function definition.

• A code portion that is part of table initialization code can be very difficult to rewrite as shared code.– CCFinder identifies table definition code.

19/04/23 Dr Andy Brooks 10

Clone Detection Process

1.

2.

3. 4.

19/04/23 Dr Andy Brooks 11

1. Lexical Analysis

• Source files are divided into tokens according to the rules of the language.

• The tokens from all source files are concatenated into a single sequence of tokens.

• Whitespaces, newlines, tabs, and comments between tokens are removed.– Sent to ‘Formatting’ to enable reconstruction

of the original source files.

19/04/23 Dr Andy Brooks 12

2.1 Transformation By Rules

19/04/23 Dr Andy Brooks 13

2.1 Transformation By Rules

19/04/23 Dr Andy Brooks 14

2.2 Parameter Replacement

• After 2.1, identifiers for types, variables, and constants are replaced with a special token

3. Match Detection• All clone pairs detected

– (Leftbegin,LeftEnd,RightBegin,RightEnd) with respect to the token sequence

4. Formatting• Locations of clone pairs converted into line

numbers in the original source files

19/04/23 Dr Andy Brooks 15

Sample Code*

*

*

*

19/04/23 Dr Andy Brooks 16

Sample Code Transformed 2.1

*

*

*

*

19/04/23 Dr Andy Brooks 17

Sample Code Transformed 2.2

Clone pairsLines 1:7 and 11:17Lines 8:10 and 19:21

19/04/23 18

Matrix Visualizationtoken

line

11.

17.

19.21.

19/04/23 Dr Andy Brooks 19

Metrics For Clone Pairs/Classes

• LEN(p), LEN(C)– Length can be measures in tokens, SLOC, and LOC

(LOC excludes null or comment lines).– The token length of each portion of a clone class is

identical when measured on the transformed token sequence.

– LOC is used in the following metric definitions.

• POP(C)– The number of elements in a clone class C.– A large POP means similar code portions appear in

many places.

19/04/23 Dr Andy Brooks 20

Metrics For Clone Pairs/Classes• DFL(C)

– Deflation is an estimate of how much code is removed when a clone class is rewritten as shared code.

– Suppose USELEN(C) is length of the caller statement.– LEN(C) x POP(C) - (USELEN(C) x POP(C) + LEN(C))

• COVERAGE (%LOC)– percentage of lines that include any portion of a clone

• COVERAGE (%FILE)– percentage of files that include any clones

19/04/23 Dr Andy Brooks 21

Metrics For Clone Pairs/Classes

RAD(C)

19/04/23 Dr Andy Brooks 22

Metrics For Clone Pairs/Classes

• RAD(C) is the maximum length of path from each file (containing a clone code portion belonging to C) to the lowest common ancestor.

• If all code portions of C are included in one file then RAD(C) = 0.

• A large RAD implies code portions spread throughout different subsystems.– Making maintenance difficult if each subsystem is

maintained by different engineers.

19/04/23 Dr Andy Brooks 23

CCFinder Time and Space Complexities• CCFinder uses a suffix-tree algorithm with a time and

space complexity of O(n).• Complexity measurements made on a PC (Pentium 4,

1.5GHz, 640 MB RAM) given various sized subsets of Linux 2.4.9 source files (2600K lines)

19/04/23 Dr Andy Brooks 24

Leading Token Restriction Optimization

• Identifying as clones, code portions which begin and end on the middle of statements, is not that useful.

• Leading tokens at the beginning of clones are restricted to labels or keywords that either initiate or terminate statements.

• Leading token restriction reduces the number of nodes in the suffix tree to one third in the C, C++, and Java case studies.– Very important restriction to make the tool scalable.

19/04/23 Dr Andy Brooks 25

Repeated Code Removal Optimization

• The clone class {a2,a3,a4,a5,a6,b1-b3} will be detected.

• 6C2 = 15 clone pairs

a1 switch (c) {a2 case ´0´: value = 0; break;a3 case ´1´: value = 1; break;a4 case ´2´: value = 2; break;a5 case ´3´: value = 3; break;a6 case ´4´: value = 4; break;a7 }

b1 case ´a´: b2 flag = 2;b3 break;

19/04/23 Dr Andy Brooks 26

Repeated Code Removal Optimization

• To reduce the number of clone pairs, when building a suffix tree, after the first identification (repetition of a2 at a3), succeeding repetitions are not inserted.

• Clone pair (a2,b1-b3) is still reported.

• Repeated code removal is also said to stop reporting of self clones e.g. (a2-a5,a3-a6).

19/04/23 Dr Andy Brooks 27

Token Concatenation Optimization

• Abutting tokens that are not punctuator keywords are joined together.

• The token sequence is made shorter in exchange for greater variation in what a token stands for.

19/04/23 28

Clones in the JDK 1.3.0 >= 30 tokens

java/awt/*.java

javax/swing/*.java

org/omg/CORBA/*.java

19/04/23 29

Clones in the JDK 1.3.0 >= 30 tokens

• 570k lines in 1877 files.• CCFinder 3 minutes on a PC.• Files in the same directory are next to one

another on the diagram axes.• Most line segments look like dots because

of the scale of the graph.• Most cloning is near the main diagonal

which means most clones occur within a file or between neighbouring directories.

19/04/23 30

Similar source files in the JDK 1.3.0

• These section D files are identical apart from lines 32, 161, 163.

19/04/23 Dr Andy Brooks 31

Longest clone in the JDK 1.3.0

• 1647 tokens, 627 lines

• WindowFileChooserUI.java and MetalfileChoserUI.java each have nine internal classes, one constructor and 45 methods

• All but three methods are clones.

19/04/23 Dr Andy Brooks 32

Effects Of Rules And Preprocessing Techniques

• Disabling various rules and techniques has dramatic effects on the number of clone pairs and classes detected.

19/04/23 Dr Andy Brooks 33

Population And Length Of Clone Classes JDK 1.3.0

LEN(Token)

POP

19/04/23 Dr Andy Brooks 34

Clone Classes Of Top 5% DFL

• Source file investigation reveals various kinds of cloning:– sequence of several methods– single method body– source files generated by tool– routines within a method– entire class body

• Evidence points to different kinds of copy-and-paste style reuse in the JDK.

19/04/23 Dr Andy Brooks 35

POP And RAD In JDK 1.3.0

Over 20 transformed tokens.

swing

exception classes

exception classes

19/04/23 Dr Andy Brooks 36

Conclusions

• Tools to detect clones are themselves complex pieces of software.

• Clone detection in CCFinder is sensitive to the rules, techniques, and clone threshold size employed.

• CCFinder has been successfuly used to detect clones in the JDK 1.3.0.

• As software systems get even bigger, clone detection will play an increasingly important part in code reengineering.

niðurstöður