plagiarism monitoring and detection -- towards an open discussion edward l. jones computer...

Plagiarism Monitoring and Plagiarism Monitoring and Detection -- Towards an Open Detection -- Towards an Open

DiscussionDiscussion

Edward L. JonesComputer Information Sciences

Florida A & M University

Tallahassee, Florida

OutlineOutline

What is Plagiarism, and Why Address ItPlagiarism Detection &

CountermeasuresA Metrics-Based Detection ApproachExtending the ApproachConclusions & Future Work

Why Tackle Plagiarism?Why Tackle Plagiarism?

Plagiarism undermines educational objectives

Failure to address sends wrong message

A non-contrived ethical issue in computing

Plagiarism is hard to define

Plagiarism is costly to pursue/prosecute

An interesting problem for tinkering

What is Plagiarism?What is Plagiarism?

“use of another’s ideas, writings or inventions as

one’s own” (Oxford American Dictionary, 1980)

Shades of Gray

– Theft of work

– Gift of work

– Collusion

– Collaboration

– Coincidence

Intent to Deceive

How is it Detected?How is it Detected?

By chance

– Anomalies

– Temporal proximity when grading

Automation methods

– Direct text comparison (Unix diff)

– Lexical pattern recognition

– Structural pattern recognition

– Numeric profiling

Plagiarism Concealment Plagiarism Concealment

TacticsTactics

None

Change comments

Change formatting

Rename identifiers

Change data types

Reorder blocks

Reorder statements

Reorder expressions

Superfluous code

Alternative control

structures

Prosecution -- DA in the Prosecution -- DA in the

House?House? Course syllabus broaches the subject

– Concrete definition generally lacking

– Sense of “we’ll know it when we see it”

N? Tolererance Policy

Investigation Stage

Prosecution Stage

Missed opportunity to teach?

An Awareness ApproachAn Awareness Approach Monitor closeness of student programs

– Objective measures

– Automated

Post anonymous closeness results in public

– Nonconfrontational awareness

– “A word to the wise … “

Benchmark student behavior

– Establishing thresholds

– Effects of course, language

Program 2Program 2

Program 1Program 1

( lines1, words1, characters1

Closeness Measures -- Closeness Measures -- PhysicalPhysical

( lines2, words2, characters2)

Euclidean Distance

Program 2Program 2

Program 1Program 1

( length1, vocabulary1, volume1)

Closeness Measures -- Closeness Measures -- HalsteadHalstead

( length2, vocabulary2, volume2)

Euclidean Distance

Comparison of MeasuresComparison of Measures Physical profile ==> weight test

– Simple/cheap to compute (Unix wc command)

– Sensitive to character variations

Halstead profile ==> content test

– More complex/expensive to compute

– Ignores comments and white space

– Sensitive only to changes in program content

Detection effectiveness vs. plagiarism tactic

Closeness ComputationCloseness Computation Normalization

– Establish upper bound for comparison (1.414)

– Distance computed on normalized (unit) vectors

Normalization I -- Self normalization

– p = (a, b, c) ==> (a/L, b/L, c/L)

– Largest component dominates

Normalization II -- Global scaling

– p = (a, b, c) ==> q = (a/aMAX, b/bMAX, c/cMAX)

– Self normalization applied to q

Distribution Of Closeness Distribution Of Closeness ValuesValues

Figure 2. Distribution of Halstead Closeness

0.000000.005000.010000.015000.020000.025000.030000.035000.040000.04500

0 100 200 300 400 500

Student Program Pairs

Clo

sen

ess

Mea

sure

Figure 2. Distribution of Halstead Closeness

0.000000.005000.010000.015000.020000.025000.030000.035000.040000.04500

0 100 200 300 400 500

Student Program Pairs

Clo

sen

ess

Mea

sure

Comparison of ProfilesComparison of Profiles

Closeness DistributionCloseness Distribution

Closeness values vary by assignment Programming language may lead clustering at

the lower end of the spectrum Reuse of modules leads to cluster ingat the

lower end of the spectrum No a priori threshold pin-pointing plagiarism All measures exhibit these behaviors

Suspect IdentificationSuspect IdentificationCollaboration Suspects (5-th Percentile)

Rank Closeness student1 student21 0.00000000 alpha alpha2 0.00000652 alpha beta3 0.00026963 beta gamma4 0.00026981 alpha gamma5 0.00031262 gamma epsilon6 0.00048815 sigma delta7 0.00049825 alpha epsilon8 0.00050169 beta epsilon9 0.00066481 gamma theta

10 0.00073158 beta theta

Independence IndexIndependence IndexStudent Independence Indices

Index student1

1 alpha 2 beta

3 gamma5 epsilon6 sigma6 delta9 theta

Index = position at which student debuts on Closeness List

Preponderance of EvidencePreponderance of Evidence Historical Record of Student Behavior

– Collaboration/partnering

– Independence indices

Profile and analyze other artifacts

– Compilation logs

– Execution logs

Another ApproachAnother Approach

Make student demonstrate familiarity with

submitted program

– Seed errors into program

– Time limit for removing error and resubmitting

Holistic approach

– Intentional, not accidental

ConclusionsConclusions

We can do something about plagiarism -- the first step is to develop eyes and ears

Simple metrics appear to be adequateTools are essentialSophistication is not as necessary as

automationStudents are curious to know how they

compare with other students

On-Going & Future WorkOn-Going & Future Work

Complete the toolset– Student Independence Index

Incorporate other Artifacts– Compilation logs– Execution logs

Integrate into Automated GradingDisseminate Results

– Package tool as shareware

Questions?

Questions?

Questions?

Thank You

Flow ChartFlow Chart

Student Programs

ProfileComputeCloseness

SuspiciousPrograms

plagiarism monitoring and detection -- towards an open discussion edward l. jones computer...

Documents