large-scale plagiarism detection and authorship attribution

18

Upload: urbana

Post on 07-Jan-2016

39 views

Category:

Documents

6 download

Report

Download

Embed Size (px):

DESCRIPTION

Large-scale Plagiarism Detection and Authorship attribution. References. Juxtapp: A Scalable System for Detecting Code Reuse Among Android Applications Steve Hanna, Ling Huang, Edward Wu1, Saung Li, Charles Chen, and Dawn Song - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Large-scale Plagiarism Detection and Authorship attribution

Page 2: Large-scale Plagiarism Detection and Authorship attribution

Juxtapp: A Scalable System for Detecting Code Reuse Among Android Applications

Steve Hanna, Ling Huang, Edward Wu1, Saung Li, Charles Chen, and Dawn Song

On the Feasibility of Internet-Scale Author Identification Arvind Narayanan, Hristo Paskov, Neil Zhenqiang Gong,

John Bethencourt, Eui Chul Richard Shin, Dawn Song, Emil Stefanov

Page 3: Large-scale Plagiarism Detection and Authorship attribution

Used to be applicable to literary corpus/ academia only

Source code similarity/plagiarism detection is very important

“Moss” is the most widely known s/w similarity detection tool

Can provide valuable insight into malware detection

Page 4: Large-scale Plagiarism Detection and Authorship attribution

Generally not true

In the android apps domain, it can be!

86% of the android malwares are repackaged versions of legitimate apps with malicious payloads (source: “Dissecting android malware:characterization and evolution”)

Similarity detection is crucial

Page 5: Large-scale Plagiarism Detection and Authorship attribution

Each android app is an apk file, ends with a .apk extension

Each apk file has .dex file which is a dalvik executable file and is executed by the dalvik virtual machine

Fingerprint the apk using bithashing

Page 6: Large-scale Plagiarism Detection and Authorship attribution

Page 7: Large-scale Plagiarism Detection and Authorship attribution

Page 8: Large-scale Plagiarism Detection and Authorship attribution

Value of K was set to 5 and was selected by an experiment. Pairs of apps were selected from randomly sampled 6000 apps. The distance between the pairs were computed. It was found that starting from 5, the value of K has little impact on the distance calculation

Mean is 5.35 opcodes and median is 2 opcodes, while the largest basic block in the dataset contains 35517 opcodes

Page 9: Large-scale Plagiarism Detection and Authorship attribution

The bitvector size m is chosen by experiment. m >> N, the number of k-grams extracted from an application between two k-gram feature sets

30000 apps were used to determine m.

m = N90 x 9 = 240,007, a prime number

Page 10: Large-scale Plagiarism Detection and Authorship attribution

Given two bitvector representations of two apps A and B, their similarity is computed by the given formula:

J(A,B) = |A ∧ B| / |A ⋁ B|

This formula Is a variation of the original Jaccard similarity.

Page 11: Large-scale Plagiarism Detection and Authorship attribution

If the app is heavily obfuscated, then juxtapp may not perform well

Use of third-party libraries can add a lot of noise and adversely affect the similarity score

Page 12: Large-scale Plagiarism Detection and Authorship attribution

Who wrote it?

Identify an anonymous author by comparing his/her writing style against a corpus of texts of known authorship

Primary application has shifted from literary domain to forensics : terrorist threats, harassment

Page 13: Large-scale Plagiarism Detection and Authorship attribution

2.4 million posts from 100,000 blogs (almost a billion words)

Stylometry : Identify author based on writing style

Are N-gram techniques suitable? – Not really, because they reveal more about the context rather than the author

Page 14: Large-scale Plagiarism Detection and Authorship attribution

Prepare test set and training set

Build a classifier with the training set

Test the classifier with the test set

Which features should be considered?

Page 15: Large-scale Plagiarism Detection and Authorship attribution

Page 16: Large-scale Plagiarism Detection and Authorship attribution

Syntax tree by Stanford parser Yule’s K

k = 10000*(M-N)/(N*N)

N= Total number of words in the text

M = ∑ i * i * Vi

where Vi is the number of words that occur i times

Page 17: Large-scale Plagiarism Detection and Authorship attribution

In 20% of cases the classifiers can correctly identify an anonymous author given a corpus of texts from 100,000 authors

In 35% of cases the correct author is one of the top 20 guesses

Page 18: Large-scale Plagiarism Detection and Authorship attribution

Misleading Authorship Attribution of Source Code using … · of stylistic patterns. Consequently, methods for authorship attribution need to extract an expressive set of features

I. Authorship Attribution Paradigm - Brown University. Authorship Attribution Paradigm Historians, literary scholars, psychologists, and – more recently – computational linguists

2011-03 Authorship attribution of short messages using

Authorship Attribution of Arabic Articles · Dataset Proper dataset for Arabic articles authorship attribution was not found A Dataset was manually collected 7 authors 10 articles

Computational Methods in Authorship Attributionu.cs.biu.ac.il/~koppel/papers/authorship-JASIST-final.pdf · 2 Computational Methods in Authorship Attribution Abstract Statistical

A Survey of Modern Authorship Attribution Methods · 1 A Survey of Modern Authorship Attribution Methods Efstathios Stamatatos Dept. of Information and Communication Systems Eng

Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester [email protected]

Unstyle: A Tool for Evading Authorship Attribution

Authorship Attribution through Function Word Adjacency ...aribeiro/preprints/2015_segarra_etal_a.pdf · 1 Authorship Attribution through Function Word Adjacency Networks Santiago

Authorship Attribution Through Words Surrounding Named

Source Code Authorship Attribution - RMIT Universityresearchbank.rmit.edu.au/view/rmit:10828/Burrows.pdfSource Code Authorship Attribution A thesis submitted for the degree of Doctor

Authorship Attribution with Topic Models - ACL Member ... · Seroussi, Zukerman, and Bohnert Authorship Attribution with Topic Models showing how topic models can be utilized to improve

A Proﬁle-Based Authorship Attribution Approach to ... - VUW

Authorship Attribution Using Lexical Attractiongroups.csail.mit.edu/genesis/papers/Gerritsen 2003.pdf · 2003. 6. 13. · Authorship attribution determines who wrote a text when it

Gramsci ’ s authorship attribution of anonymus newspapers articles

Authorship Attribution for Social Media Forensics · Data extraction The ultimate forensics goal : authorship attribution all the retweet should be removed

E-mail Authorship Attribution Using Customized … FORENSIC RESEARCH CONFERENCE E-mail Authorship Attribution Using Customized Associative Classification By Michael Schmid, Farkhund

Inter-Textual Distance and Authorship Attribution

A Survey of Modern Authorship Attribution Methods

MACHINE LEARNING METHOD FOR AUTHORSHIP …3660/datastream/OBJ/...ABSTRACT MACHINE LEARNING METHOD FOR AUTHORSHIP ATTRIBUTION By Xianfeng Hu Broadly speaking, the authorship identi

Does Size Matter Authorship Attribution, Small Samples, Big Problem

Computational approaches to plagiarism detection and ...ler.letras.up.pt/uploads/ficheiros/13614.pdf · Computational approaches to plagiarism detection and authorship attribution

Computational stylistics and authorship attribution: …th- · Jagiellonian University Institute of English Studies Jeremi K. Ochab Computational stylistics and authorship attribution:

NON-TRADITIONAL AUTHORSHIP ATTRIBUTION STUDIES: IGNIS ...2000)/B... · NON-TRADITIONAL AUTHORSHIP ATTRIBUTION STUDIES: IGNIS FATUUS OR ROSETTA ... and stylistics to determine who

Authorship Verification Authorship Identification Authorship Attribution Stylometry

Automated Authorship Attribution Using Advanced … Authorship Attribution Using Advanced Signal Classification Techniques Maryam Ebrahimpour1,Ta¯lis J. Putnin¸sˇ1,2,3, Matthew

Automated Authorship Attribution Using A

Authorship attribution using Discriminant Function Analysis

Computer-Based Authorship Attribution Without Lexical Measures · COMPUTER-BASED AUTHORSHIP ATTRIBUTION WITHOUT LEXICAL MEASURES 195 as possible. In other words, the set of the style

Machine Learning for Authorship Attribution in Arabic …€”This paper presented an authorship attribution in Arabic poetry using machine learning. Public features in poetry such

Authorship Attribution Using Lexical Attraction - Home …groups.csail.mit.edu/genesis/papers/Gerritsen2003.pdf · Authorship Attribution Using Lexical Attraction by Corey M. Gerritsen

A Stuctural Approach to Authorship Attribution using ... Stuctural Approach to Authorship Attribution using Dependency Grammars Victor Wennberg Victor Wennberg Fall 2012 Thesis project,

Authorship Attribution and Stylometry

Computer-Based Authorship Attribution Without … AUTHORSHIP ATTRIBUTION WITHOUT LEXICAL MEASURES 195 as possible. In other words, the set of the style markers is adapted to the automaticPublished

Text Categorization Moshe Koppel Lecture 3:Authorship Attribution