stylometry

14
Stylometry System CSIS CSIS Stylometry Projects, mostly Fall 2009 Project Seidenberg School of Computer Science and Information Systems

Upload: onawa

Post on 05-Jan-2016

32 views

Category:

Documents


2 download

DESCRIPTION

Stylometry. Projects, mostly Fall 2009 Project. Seidenberg School of Computer Science and Information Systems. Stylometry - is the study of the unique linguistic styles and writing behaviors of individuals in order to determine authorship. Description of Project. Part I - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Stylometry

Stylometry System

CSISCSIS

Stylometry

Projects, mostly Fall 2009 Project

Seidenberg School of Computer Science and Information Systems

Page 2: Stylometry

Stylometry System

CSISCSIS

Description of Project

Stylometry - is the study of the unique linguistic styles and writing behaviors of individuals in order to determine authorship• Part I

– Search to determine an interesting and unique application of stylometry for Research

• Part II– Feasibility study on existing tools/applications for email authorship (250 words or less)

Page 3: Stylometry

Stylometry System

CSISCSIS

Existing / Potential Uses of Stylometry

• Music Lyrics • Plagiarism

• Music Melody • Social Networking

• Paintings • Electronic Mail

• Literary Works • Instant Messaging

• Forensic Linguistics

- Social networking, electronic mail, and instant messaging are still in early stages of study

Page 4: Stylometry

Stylometry System

CSISCSIS

Use Cases

- Twitter- Used to verify existing Twitter accounts and help

mitigate impersonations

- Electronic mail- Implemented in a corporate setting helping identify

anonymous emails meant to do harm

- Chat - Assist in determining authorship of instant messages- Similar to Twitter but needs to be dynamic

Page 5: Stylometry

Stylometry System

CSISCSIS

Use Cases

- Terrorism- Help identify an author of terrorist content or identify

terrorist content by using contextual analysis- Applied to blogs, forums, wikis, email, chat and other

forms of digital content

Page 6: Stylometry

Stylometry System

CSISCSIS

Tools discovered

- JGAAP (Java Graphical Authorship Attribute Program)

- Signature Tool

- C# Tool

- StyleTool

- Blog stylometry tool

- Stylometry tool

Page 7: Stylometry

Stylometry System

CSISCSIS

Tools discovered- JGAAP (Java Graphical Authorship Attribute

Program)- Java based tool- Runs on Windows and Linux- Identification tool

- 1 of n decision – Many known email authors trying to determine the author of one unknown email

- One unknown email author compared to 99 known email authors

- 100 total tests run

Page 8: Stylometry

Stylometry System

CSISCSIS

Tools discovered- C# Tool

- Written in C programming language- Developed by prior Pace CS graduate students- Identification tool

- 1 of n decision – Many known email authors trying to determine the author of one unknown email

- One unknown email author compared to 99 known email authors

- 100 total tests run

Page 9: Stylometry

Stylometry System

CSISCSIS

Tools discovered- Signature Tool

- Written in C programming language (not confirmed)- Created by Peter Millican from Hartford College- Authentication Tool

- Either match / no match

- Match testing – 9 known and 1 unknown sample (same author)

- No Match – 10 known and 1 unknown (two different authors)

- Total of 105 tests were run

Page 10: Stylometry

Stylometry System

CSISCSIS

Testing methodology - Each team member submitted 20 (or 30) actual

emails from 2 (3) different authors.- Total of 100 emails collected from 10 different authors- Removed from native program and saved as text files- Average size (words) of email 195.7

- Different testing for identification and authentication tools

- For authentication tool- False Accept Rate - Rate a document is falsely attributed to an author

- False Reject Rate - Rate a document is not correctly attributed to an author

Page 11: Stylometry

Stylometry System

CSISCSIS

Testing Results JGAAP (Levenshtein Distance algorithm)

Canonizers On Off

Words 50% 30%

Word Length 50% 30%

Characters 60% 40%

Syllables per Word 40% 30%

Word Bigrams 70% 60%

Signature Tool Match Test

Events Accuracy FRR

Word Length 53.33% 46.67%

Letters 46.67% 53.33%

Signature Tool No-Match Test

Events Accuracy FAR

Word Length 53.33% 46.67%

Letters 82.22% 17.78%

C# Tool Match Test

Accuracy

57%

Categorizing the result basedon the country of the author

Tool

Match No-Match

IndiaUSA

IndiaUSA

JGAAP 50% 100% NA NA

Signature 61.11% 75.00% 81.48% 83.33%

C# Tool 42% 80.00% NA NA

Page 12: Stylometry

Stylometry System

CSISCSIS

Earlier Study’s Features – 20 of 55• 1. Number of sentences beginning with upper case• 2. Number of sentences beginning with lower case• 3. Number of Words• 4. Average Word Length• 5. Number of Sentences• 6. Average Number of Words per Sentence• 7. Number of Paragraphs• 8. Average Number of words per Paragraph• 9. Number of Exclamation Marks• 10. Number of Number Signs• 11. Number of Dollar Signs• 12. Number of Ampersands• 13. Number of Percent Signs• 14. Number of Apostrophes• 15. Number of Left parentheses• 16. Number of Right parentheses• 17. Number of Asterisks• 18. Number of Plus Signs• 19. Number of Commas• 20. Number of Dashes

Page 13: Stylometry

Stylometry System

CSISCSIS

Conclusion - Overall the moderate accuracy of the test results

suggest that none of the tools evaluated are capable of accurate stylometric email author identification

- Categorizing email samples by country of origin seems to yield better accuracy results for all three tools tested.

Page 14: Stylometry

Stylometry System

CSISCSIS

Recommendations - Further testing and research using email from

authors of different countries

- Continue to refine and add to the stylistic feature set created by prior Pace graduate students

- Include new features becoming more prevalent in digital content. Ex. Emoticons, hyperlinks

- Internet slang – BRB, LOL, TTYL

- Consideration for people who wish to disguise their identity needs to be addressed and researched further