Plagiarism WorkshopMike Joy University of Bath, 29 February 2012
Emergency exits
Fire alarm
Toilets
Certificates of attendance
2
Administrative Issues
1.30 Introduction1.50 What is plagiarism?2.00 Our experiences2.20 Text plagiarism2.30 Computing and mathematics2.50 Why do students plagiarise?3.00 How do students plagiarise?3.15 Break3.30 Detection strategies and tools3.45 Prevention strategies and university process4.00 Discussion and conclusion4.15 End
3
Timetable
What is Plagiarism?
“The action or practice of taking someone else's work, idea, etc., and passing it off as one's own; literary theft” (OED Online, 2012)
“To commit literary theft; to present as new and original an idea or product derived from an existing source” (Merriam-Webster Online, 2012)
These definitions are open to interpretation.
What about equations, computer programs, etc.?
“Academic integrity”
5
Definitions
Not all cheating is plagiarism.
For example, taking crib-sheets into an exam.
What about “contract cheating”, where student pays another to write an assignment for them?
We adopt a broad interpretation of “plagiarism” (otherwise we may miss important types of cheating which are appropriate for use to cover here).
6
Plagiarism vs. Cheating
Cheating is potentially illegal.
Not fair on the other students.
Compromises the learning process.
Wastes time
— Staff time!
— Paperwork, disciplinary process
We are required to deal with it!
— QAA Quality Code (B6)
7
Why is this Important?
“If you go the bar at lunchtime you can buy a solution to any of our programming assignments. I reckon the incidence of plagiarism is over 50%” (source wishes to remain anonymous, dated 1999).
Around 5% in programming assignments at Warwick University (from detailed analyses of first year programming assignments over several years, from 2002-2004).
Documented cases (90 UK HEIs, all subjects) – 0.72% (source: AMBeR Project Report 2008).
8
How big a Problem is Plagiarism?
Detection is fun.
Algorithms can be applied to the detection process (so Computer Scientists can apply their skills).
Getting involved gives us insights into how students are conducting their studies.
9
Why is this Interesting?
Our experiences
Rainbow Lorikeet, by René Modery, 2006
Basic TheoryBasic Theory
Foundations of the Louvre, photo by Ceronne, 2006
Students must know and understand (clear University policy).
Detection must happen (the more the better!).
Due process (punishment).
Thus … four stages:
Collection Detection Confirmation Investigation
(Culwin and Lancaster, 2002).
12
Four Stages
Get all documents together online:
– So they can be processed;
– Document formats need to be considered;
– Security is an issue.
Coursemaster (Nottingham)
BOSS (Warwick)
Managed Learning Environment (Blackboard, Moodle)
13
Stage 1: Collection
(1) Compare with other submissions (“intra-corpal”)
(2) Compare with external documents (“extra-corpal”)
– essay-based assignments, can use Turnitin
– program code, equations, maybe a problem
(1) is (relatively) easy (can even be done by hand), but
(2) is a big problem.
14
Stage 2: Detection
Software tool says “A and B similar”.
Are they?
Never rely on a computer program!
Requires expert human judgement.
Evidence must be compelling.
Might go to court.
15
Stage 3: Confirmation
A from B, or B from A, or joint work?
If A from B, did B know?
– Open networked file?
– Printer output?
Did the culprit/s understand?
University processes must be followed:
– No shortcuts!
16
Stage 4: Investigation
Text Plagiarism
“Portrait of a Scribe” by Bartolomeo Passerotti (1529-1592)
Essay time …
Funded mainly by subscriptions from institutions.
Cache of – the Internet– all documents submitted to it– anything else it can find!
Compares text of documents submitted to it using a string-matching algorithm.
19
Turnitin® UK
Can be used by academics to
– detect plagiarism
– provide evidence
Can be used by students to
– check their own work
20
Pedagogy
21
Turnitin (1)
22
Turnitin (2)
23
Turnitin (3)
AdvantagesReasonably
accurate
Ease and speed of use
Printed reports
Comprehensive datastore
Most formats
Management tool
DisadvantagesAlgorithm can be fooled
English only
Quotes and references are poorly handled
“False sense of security” 24
Algorithm and Functionality
Computing and Mathematics
A PowerMac G4 ("Mirrored Drive Doors" model) with open case showing the logic board. Photo by Alistair McMillan, 2006.
Discipline specific:
Program code
Diagrams (UML, flowcharts, etc.)
Lab reports
Images (graphics, image processing)
26
Computing
Discipline specific:
Equations
Theorems and proofs
Statistical analyses
MATLAB programs
27
Mathematics
It won’t work!
– String matching algorithm inappropriate
– Database does not contain (much) code
Commercial products exist, for example
– Black Duck Software
– Similix Corporation
28
Why not use Turnitin?
/* Program 1 */
public class Hello {
public static void main(String[] argv) {
System.out.println(“Hello World”)
}
}
/* Program 2 */
public class HelloWorld {
public static void main(String[] x) {
System.out.println(“hello world!”)
}
}
29
/* Programs 1 and 2 */
Program 3
(Source code for MS Windows 7)
Program 4
(code 50% identical to the source code for MS Windows 7)
30
/* Programs 3 and 4 */
public class Sun { static final double latitude=52.4; static final double longitude=-1.5 static final double tpi = 2.0*pi; /* ... */
public static void main(String[] args) { calculate(); }
public static double FNrange(double x) { double b = x / tpi; double a = tpi * (b - (long)(b)); if (a < 0) a = tpi + a; return a; };
public static void calculate() { /* ... */ }/* ... */
31
/* Program 5 */
public class SunsetCalculator { static float latitude=52.4; static float longitude=-1.5; /* ... */
public static void main(String[] args) { findSunsetTime(); }
public static double rangeCalc(float arg) { float x = arg / tpi; float y = 2*3.14159 * (x - (int)(x)); if (y < 0) y = 2*3.14159 + y; return y; };
public static void findSunsetTime() { /* ... */ }/* ... */
32
/* Program 6 */
Apart from source-code re-use, need to think about:
Use of (object-oriented) templates
Converting code to a different language
Code-generator software
Getting source-code written by someone else
What constitutes minimal / moderate / extreme plagiarism?
33
What is Source-Code Plagiarism?
“Open Source” code
Translation between languages
Re-use of code from previous assignments
Placing references within technical documentation (comments)
34
What do Students Misunderstand?
Common equations such as E=mc2 don’t need referencing.
Probably most others do.
Are there any “grey areas”?
35
Mathematical Equations
Why do Students Plagiarise?
Why, Arizona, by Ken Lund, 2010.
Money
Career advancement
Company advancement
Tight deadlines
Poor ethics
What about academics?
37
Digression – Industry
Weak students
Lazy students
Students with poor time management skills
Overworked students
Peer pressure
Cultural factors
Lack of understanding
“Bad, sad or mad” (Culwin, 2006).
38
Students
How do Students Plagiarise?
Tiles on LaSalle Street, New Orleans, by Infrogmation, 2009.
Friends
Lecturer’s notes
Seeing what other students are doing
Textbooks
Code repositories
Forums
Cheat sites
Where to Find Information
‘Rent-A-Coder’
Low rates ($10) – so quality of code?
Plagiarism by hired coders?
Private Internet sites make search engines ineffective.
Use of mobile devices and IM tools makes tracing difficult.
41
Contract Cheating
Break
Photo by Vanderdecken, 2007.
Detection Strategies
Sherlock Holmes and John H. Watson, by Sidney Paget (1860-1908)
Google search on phrases
Abnormal style
Unusual phrases or spellings (incl. in program comments)
Unusual algorithm used by a program
Unusual formatting
– Fonts, indentation (wordprocessor)
– Brace style (etc.) (program)
44
Tricks of the Trade
Detection ToolsPhoto by Wolfen Silva, 2004.
Attribute counting systems (Halstead, 1972; Ottenstein, 1976):
Numbers of unique operators
Numbers of unique operands
Total numbers of operator occurrences
Total numbers of operand occurrences
46
History (1)
Structure-based systems:
Each program is converted into token strings (or something similar)
Token streams are compared for determining similar source-code fragments
Tools: YAP3, JPlag, Plague, GPlag, XPlag., Plaggie, MOSS, Sherlock, Jones, Cogger, SID, SIM, …
47
History (2)
CodeSuite (www.safe-corp.biz)
– Exact algorithm not published
– Patents apply
MossPlus (www.similix.com)
– Commercial version of MOSS
– “multi-million dollar copyright and criminal theft cases”
– Patents apply
48
Commercial Products (examples)
int calculate(String arg) { int ans=0; for (int j=1; j<=100; j++) {
ans *= j;}
return ans;}
Integer doit(String v) { float result=0.0; for (float f=100.0; f > 0.0; f--)
result *= f; return result;}
49
Example (Tokenwise Equivalence)
type name(type name) start type name=number loop (type name=number name compare number operation name) start name operation name end
return nameend
Have a look at the program you have been given.
Can you spot the plagiarised bits?
How much is plagiarised?
What techniques have been used?
50
Intermission …
Guido Malpohl, Karlsruhe, Germany
Code fragment similarity values based on similar tokens found
Java, C#, C, C++, Scheme, and natural language text
Web-based: www.ipd.uni-karlsruhe.de/jplag
Algorithm: Parse programs and tokenise then pairwise compare using “Greedy String Tiling” (Prechelt et al., 2002)
maximises percentage of common token strings
worst case θ(n3), average case linear
Programs must compile?
51
JPlag
52
JPlag Example
Alex Aiken, Berkeley/Stanford, USA, 1994
Multilingual: C, C++, Java, Pascal, Ada, ML, Lisp, or Scheme program
Web-based: theory.stanford.edu/~aiken/moss/
“Winnowing” (Schleimer et al., 2003)
Local document fingerprinting algorithm
Efficiency proven (33% of lower bound)
Guarantees detection of matches longer than a certain threshold
53
MOSS
54
MOSS Example
University of Warwick, Open Source
Open Source – sherlock.org.uk
Multilingual (including natural language), but works best on Java
Preprocesses code (not a full parse!) then simple string comparison. Preprocessing includes:
– Remove comments– Remove whitespace– Normalise formatting/indentation– Tokenise
55
Sherlock
56
Sherlock Example
57
Sherlock – Document Set
MOSS, JPlag and Sherlock are effective
Results returned are not identical, but similar
User interface issues are important
Reliable sets of test data are unavailable.
None of these tools pulls material from the Internet
58
Effectiveness
Latent Semantic Analysis (Cosma and Joy, 2010)
Documents as “bags of words”Known technique in IRHandles synonymy and polysemyMaths is nasty
Clone Detection (Brixtel et al., 2010; Koschke, 2007)
Provenance of code in large software systemsUse of very large datasets (e.g. SourceForge)Not targeted at plagiarismTools include Dup and VCCFinder
59
Other Approaches
Prevention Strategies
Sometimes students are asked to copy
– group assignments
We ask students to share ideas
– that’s what universities are for!
Real programmers re-use code
What is plagiarism?
– maybe not a simple question after all!
61
Plagiarism vs. Collaboration
Never re-use assignments.
Assess deeper levels of learning.
Use tasks allowing multiple solutions.
Integrate tasks.
Set tasks based on recent events / sources.
Configure assignments so each students is given a slightly different version.
Require assignments to be done in controlled conditions (labs).
62
Prevention and Cure (1)
Define institution policy clearly.
Define rôles of institution bodies (exam board, tribunal, etc.)
Make disciplinary process also about learning.
Train staff.
Fast track procedure for minor cases.
Record and monitor.
Adapted from Carroll and Appleton (2001).
63
Prevention and Cure (2)
ProcessOld Bailey, 2006. Unattributed (Wikimedia).
First offence (unless very serious, e.g. PhD), meeting with appropriate senior member of staff in Department:
– tutor / friend / SU representative allowed to accompany student
– nominal penalty available (e.g. mark of 0 for assignment)
– “formative” experience for the student
Second offence (or serious first offence)
– University tribunal
– tutor / friend / SU representative allowed to accompany student
– full range of penalties (including expulsion)
65
Typical Institution Policy
Quality Assurance Code of Practice QA53.
Three levels of offence – Group 1 (minor), Group 2 (moderate), Group 3 (severe).
Possible penalties available for an offence specified by Group (see table in appendix to QA53).
Groups 1 and 2 offences dealt with by Department.
Group 3 offences initiate Board of Inquiry.
Appeals are allowed under certain conditions only.
66
University of Bath
Discussion
The round table, Great Hall, Winchester Castle, by Graham Horn, 2009.
Evaluation