gplag: detection of software plagiarism by program dependence graph analysis

42
1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at Urbana-Champaign IBM T.J. Waston Research Center Presented by Chao Liu

Upload: boone

Post on 05-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis. Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at Urbana-Champaign IBM T.J. Waston Research Center Presented by Chao Liu. Motivations. Blossom of open-source projects - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

1

GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu

University of Illinois at Urbana-ChampaignIBM T.J. Waston Research Center

Presented by Chao Liu

Page 2: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

2

Motivations Blossom of open-source projects

SourceForge.net: 125,090 projects as July 2006 Convenience for software plagiarism?

You can always find something online Core-part plagiarism

Ripping off GUIs and irrelevant parts (Illegally) reuse the implementations of core-

algorithms Our goal

Efficient detection of core-part plagiarism

Page 3: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

3

Challenges Effectiveness

Professional plagiarists Automated plagiarism

Efficiency Only a small part of code is plagiarized, how

to detect it efficiently?

Page 4: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

4

Outline

Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions

Page 5: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

5

Original Program

01 static void02 make_blank (struct line *blank, int count)03 {04 int i;05 unsigned char *buffer;06 struct field *fields;07 blank->nfields = count; 08 blank->buf.size = blank->buf.length = count + 1;09 blank->buf.buffer = (char*) xmalloc (blank->buf.size);10 buffer = (unsigned char *) blank->buf.buffer;11 blank->fields = fields = (struct field *) xmalloc (sizeof (struct field) * count); 12 for (i = 0; i < count; i++){13 ... 14 }15 }

A procedure in a program, called join

Page 6: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

6

Disguise 1: Format Alteration

01 static void02 make_blank (struct line *blank, int count)03 {04 int i;05 unsigned char *buffer;06 struct field *fields;07 blank->nfields = count; // initialization08 blank->buf.size = blank->buf.length = count + 1;09 blank->buf.buffer = (char*) xmalloc (blank->buf.size);10 buffer = (unsigned char *) blank->buf.buffer;11 blank->fields = fields = (struct field *) xmalloc (sizeof (struct field) * count); 12 for (i = 0; i < count; i++){13 ... 14 }15 }

Insert comments and blanks

Page 7: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

7

Disguise 2: Identifier Renaming

01 static void02 fill_content (struct line *fill, int num)03 {04 int i;05 unsigned char *buffer;06 struct field *fields;07 fill->nfields = num; // initialization08 fill->buf.size = fill->buf.length = num + 1;09 fill->buf.buffer = (char*) xmalloc (fill->buf.size);10 buffer = (unsigned char *) fill->buf.buffer;11 fill->fields = fields = (struct field *) xmalloc (sizeof (struct field) * num); 12 for (i = 0; i < num; i++){13 ... 14 }15 }

Rename variables consistently

Page 8: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

8

Disguise 3: Statement Reordering

01 static void02 fill_content (struct line *fill, int num)03 {04 int i;05 unsigned char *buffer;06 struct field *fields;11 fill->fields = fields = (struct field *) xmalloc (sizeof (struct field) * num); 08 fill->buf.size = fill->buf.length = num + 1;09 fill->buf.buffer = (char*) xmalloc (fill->buf.size);10 buffer = (unsigned char *) fill->buf.buffer;07 fill->nfields = num; // initialization 12 for (i = 0; i < num; i++){13 ... 14 }15 }

Reorder non-dependent statements

Page 9: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

9

Disguise 4: Control Replacement

01 static void02 fill_content (struct line *fill, int num)03 {04 int i;05 unsigned char *buffer;06 struct field *fields;11 fill->fields = fields = (struct field *) xmalloc (sizeof (struct field) * num); 08 fill->buf.size = fill->buf.length = num + 1;09 fill->buf.buffer = (char*) xmalloc (fill->buf.size);10 buffer = (unsigned char *) fill->buf.buffer;07 fill->nfields = num; // initialization 12 i = 0; 13 while (i < num){14 ...15 i++; 16 }

17 }

Use equivalent control structure

Page 10: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

10

Disguise 5: Code Insertion

01 static void02 fill_content (struct line *fill, int num)03 {04 int i;05 unsigned char *buffer;06 struct field *fields;11 fill->fields = fields = (struct field *) xmalloc (sizeof (struct field) * num); 08 fill->buf.size = fill->buf.length = num + 1;09 fill->buf.buffer = (char*) xmalloc (fill->buf.size);10 buffer = (unsigned char *) fill->buf.buffer;07 fill->nfields = num; // initialization 12 i = 0; 13 while (i < num){14 ... for (int j = 0; j < i; j++);15 i++; 16 }

17 }

Insert immaterial code

Page 11: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

11

Fully Disguised01 static void02 make_blank (struct line *blank, int count)03 {04 int i;05 unsigned char *buffer;06 struct field *fields;

07 blank->nfields = count;08 blank->buf.size = blank->buf.length = count + 1;09 blank->buf.buffer = (char*) xmalloc (blank->buf.size);10 buffer = (unsigned char *) blank->buf.buffer;11 blank->fields = fields =

(struct field *) xmalloc (sizeof (struct field) * count);

12 for (i = 0; i < count; i++){13 ...14 }15 }

Original C ode

01 static void02 fill_content(int num, struct line* fill)03 {04 (*fill).store.size = fill->store.length = num + 1;05 struct field *tabs;06 (*fill).fields = tabs = (struct field *) xmalloc (sizeof (struct field) * num);07 (*fill).store.buffer = (char*) xmalloc (fill->store.size);08 (*fill).ntabs = num;09 unsigned char *pb;10 pb = (unsigned char *) (*fill).store.buffer;

11 int idx = 0;12 while(idx < num){ // fill in the storage13 ...14 for(int j = 0; j < idx; j++)15 ...16 idx++;17 }18 }

P lagiar ized C ode

Page 12: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

12

Outline

Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions

Page 13: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

13

Review of Plagiarism Detection String-based [Baker et al. 1995]

A program represented as a string Blanks and comments ignored.

AST-based [Baxter et al. 1998, Kontogiannis et al. 1995] A program is represented as an Abstract Syntax Tree (AST) Fragile to statement reordering, control replacement and

code insertion Token-based [Kamiya et al. 2002, Prechelt et al. 2002]

Variables of the same type are mapped to the same token A program is represented as a token string Fingerprint of token strings is used for robustness [Schleimer

et al. 2003] Partially robust to statement reordering, control replacement

and code insertion Representatives: Moss and JPlag

Page 14: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

14

Outline

Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions

Page 15: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

15

Graphic representation of source codeint sum(int array[], int count){ int i, sum; sum = 0; for(i = 0; i < count; i++){ sum = add(sum, array[i]); } return sum;}

int add(int a, int b){ return a + b;}

Page 16: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

16

Graphic representation of source code

int sum(int array[], int count){ int i, sum; sum = 0; for(i = 0; i < count; i++){ sum = add(sum, array[i]); } return sum;}

int add(int a, int b){ return a + b;}

Page 17: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

17

Control Dependency

int sum(int array[], int count){ int i, sum; sum = 0; for(i = 0; i < count; i++){ sum = add(sum, array[i]); } return sum;}

int add(int a, int b){ return a + b;}

Page 18: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

18

Data Dependency

int sum(int array[], int count){ int i, sum; sum = 0; for(i = 0; i < count; i++){ sum = add(sum, array[i]); } return sum;}

int add(int a, int b){ return a + b;}

Page 19: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

19

Plagiarism Detectible?01 static void02 make_blank (struct line *blank, int count)03 {04 int i;05 unsigned char *buffer;06 struct field *fields;

07 blank->nfields = count;08 blank->buf.size = blank->buf.length = count + 1;09 blank->buf.buffer = (char*) xmalloc (blank->buf.size);10 buffer = (unsigned char *) blank->buf.buffer;11 blank->fields = fields =

(struct field *) xmalloc (sizeof (struct field) * count);

12 for (i = 0; i < count; i++){13 ...14 }15 }

Original Code

01 static void02 fill_content(int num, struct line* fill)03 {04 (*fill).store.size = fill->store.length = num + 1;05 struct field *tabs;06 (*fill).fields = tabs = (struct field *) xmalloc (sizeof (struct field) * num);07 (*fill).store.buffer = (char*) xmalloc (fill->store.size);08 (*fill).ntabs = num;09 unsigned char *pb;10 pb = (unsigned char *) (*fill).store.buffer;

11 int idx = 0;12 while(idx < num){ // fill in the storage13 ...14 for(int j = 0; j < idx; j++)15 ...16 idx++;17 }18 }

P lagiar ized Code

Page 20: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

20

Corresponding PDGs

3: dec l.,line* blank

8: dec l .,int c ount

12: dec l.,int i

13: ass ign,i = 0

14: inc .,i++

15: c ontro li < c ount

9: as s ig n ,b lan k->b u f.s iz e

= b lan k->...

7: as s ig n ,b lan k->n field s =

co u n t

4: as s ig n ,b lan k->b u f.b u ffer = (ch ai*) xm..

0: as s ig n ,b lan k->field s =

field s = ...

10: as s ig n , b u ffer= (u n s ig n ed ) ...

11: d ec l.,c har* b uffer

5: d ec l.,s tru c t field *

field s

1: as s ig n ,field s =

(s tru c t ...

2: c all-s ite,xmalloc ()

6: c all-s ite,xmalloc ()

3: dec l.,l ine* f i ll

8: de c l.,int num

12: dec l.,int idx

13: as s ign,idx = 0

14: inc .,idx++

15: c ontro lw hile(id x < num)

9: as s ig n ,(*fill).s to re.s iz e

= ...

7: as s ig n ,(*fill).n tab s =

n u m

4: as s ig n ,(*fill).s to re.b u f =

(ch ar*) ...

0: as s ig n ,(*field ).field s =

tab = ...

10: as s ig n , p b =(u n s ig n ed

c har*) (*fill)...

11: dec l.,c har* pb

5: d ec l.,s tru c t field *

tab s

1: as s ig n ,tab s = (s tru c t

...

2: c a ll-s ite,xmalloc ()

6: c a ll-s ite,xmalloc ()

16: dec l.,int j

17: ass ign,j = 0

18: inc .,j++

19: c ontro lj < idx

PDG for the Original Code PDG for the Plagiarized Code

Page 21: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

21

PDG-based Plagiarism Detection A program is represented as a set of PDGs

Let g be a PDG of Procedure P in the original program Let g’ be a PDG of Procedure P’ in the plagiarism suspect

Subgraph isomorphism implies plagiarism If g is subgraph isomorphic to g’, P’ is likely plagiarized

from P γ-isomorphism: Graph g is γ-isomorphic to g’ if there

exists a subgraph s of g such that s is subgraph isomorphic to g’, and |s|≥ γ |g|.

If g is γ–isomorphic to g’, the PDG pair (g, g’) is regarded as a plagiarized PDG pair, and is then returned to human beings for examination.

Page 22: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

22

Advantages Robust because it is hard to overhaul PDGs

Dependencies encode program logic Incentive of plagiarism

Page 23: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

23

Outline

Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions

Page 24: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

24

Efficiency and Scalability Search space

If the original program has n procedures and the plagiarism suspect has m procedures n*m subgraph isomorphism testings

Pruning search space Lossless filter Statistical lossy filter

Page 25: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

25

Lossless filter

Interestingness PDGs smaller than an interesting

size K are excluded from both sides

γ-isomorphism definition A PDG pair (g, g’) is discarded if |

g’| <γ|g|.

Page 26: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

26

Lossy Filter Observation

If procedure P’ is plagiarized from procedure P, its PDG g’ should look similar to g.

So discard those dissimilar PDG pairs Requirement

This filter must be light-weighted

Page 27: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

27

Vertex Histogram Represent PDG g by

h(g) = (n1, n2, …, nk), where ni is the frequency of the ith kind of vertices.

Similarly, represent PDG g’ byh(g’) = (m1, m2, …, mk).

Direct similarity measurement? How to define a proper similarity threshold? Is thus defined threshold program-independent?

Page 28: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

28

Hypothesis Testing-based Approach Basic idea

Estimate a k-dimensional multinomial distribution from h(g)

Test whether h(g’) is likely an observation from

If it is, g’ looks similar to g, and an isomorphism testing is needed.

Otherwise, (g, g’) is discarded

Page 29: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

29

Technical Details

Page 30: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

30

Technical Details (cont’d)

Page 31: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

31

Work-flow of GPLAG PDGs are

generated with Codesurfer

Isomorphism testing is implemented with VFLib.

Page 32: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

32

Outline

Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions

Page 33: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

33

Experiment Design Subject programs

Effectiveness Filter efficiency Core-part plagiarism detection

Page 34: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

34

Effectiveness

2-hour manual plagiarism, but can be automated? GPLAG detects all plagiarized PDG pairs within 1 second PDG isomorphism also reveals what plagiarism disguises are applied

Page 35: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

35

Efficiency

Subject programs bc, less and tar. Exact copy as plagiarism.

Lossless and lossy filter Pruning PDG-pairs. Implication to overall time cost.

Page 36: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

36

Pruning Uninteresting PDG-pairs Lossless only Lossless and

lossy

Page 37: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

37

Implication to Overall Time Cost

Time-out for subgraph isomorphism testing, time hogs. Lossless filter does not save much time. Lossy filter significantly reduces the time cost. Major time saving comes from the avoidance of time hogs.

Page 38: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

38

Detection of Core-part Plagiarism

Lower time cost with lossy filter. Lower false positives with lossy filter.

Page 39: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

39

Outline

Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions

Page 40: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

40

Conclusions We developed a new algorithm GPLAG for

software plagiarism detection It is more effective to fight against “professional” plagiarists

We developed a statistical lossy filter, which improves the efficiency of GPLAG

We experimentally verified the effectiveness and efficiency of GPLAG

Page 41: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

41

Q & A

Thank You!

Page 42: GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

42

References[1] B. S. Baker. On finding duplication and near duplication in large software

systems. In Proc. of 2nd Working Conf. on Reverse Engineering, 1995.[2] I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier. Clone detection

using abstract syntax trees. In Proc. of Int. Conf. on Software Maintenance, 1998.

[3] K. Kontogiannis, M. Galler, and R. DeMori. Detecting code similarity using patterns. In Working Notes of 3rd Workshop on AI and Software Engineering, 1995.

[4] T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng., 28(7), 2002.

[5] L. Prechelt, G. Malpohl, and M. Philippsen. Finding plagiarisms among a set of programs with JPlag. J. of Universal Computer Science, 8(11), 2002.

[6] S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. SIGMOD, 2003.

[7] V. B. Livshits and T. Zimmermann. Dynamine: Finding common error patterns by mining software revision histories. In Proc. of 13th Int. Symp. on the Foundations of Software Engineering, 2005.

[8] C. Liu, X. Yan, and J. Han. Mining control flow abnormality for logic error isolation. In In Proc. 2006 SIAM Int. Conf. on Data Mining, 2006.

[9] C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu. Mining behavior graphs for ”backtrace” of noncrashing bugs. In SDM, 2005.