sadia afroz: detecting hoaxes, frauds, and deception in writing style online

25
Detec%ng Decep%on in Wri%ng Style Sadia Afroz, Michael Brennan and Rachel Greenstadt. Privacy, Security and Automa%on Lab Drexel University

Upload: pamselle

Post on 15-May-2015

697 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

Detec%ngDecep%oninWri%ngStyle

SadiaAfroz,MichaelBrennanandRachelGreenstadt.Privacy,SecurityandAutoma%onLab

DrexelUniversity

Page 2: Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

Overview

•  Authorshiprecogni%on•  Authorshiprecogni%oninadversarialenvironment

•  Decep%ondetec%on•  Experimentsondifferentdatasets

Page 3: Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

Authorshiprecogni%on

Whowrotethedocument?

Page 4: Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

Authorshiprecogni%on

Stylometry:– Anauthorshiprecogni%onsystembasedsolelyonwri%ngstyle.

– Nothandwri%ng– Onlylinguis%cstyle:wordchoice,sentencelength,parts‐of‐speechusage,…

Page 5: Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

Whyitworks?

•  Everybodyhaslearnedlanguagedifferently

Page 6: Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

Howregularauthorshiprecogni%onworks

MachineLearningSystem

ExtractfeaturesMachineLearning

System

Page 7: Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

MachineLearningSystem

Documentofunknownauthorship

Extractfeatures

Determineauthorship

Page 8: Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

Assump%ons

•  Wri%ngstyleisinvariant.

–  It’slikeafingerprint,youcan’treallychangeit.

– Authorshiprecogni%oncaniden%fyyouiftherearesufficientwri%ngsamplesandasetofsuspects.

Page 9: Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

AdversarialAUacks

•  Imita%onorframingaUack– Whereoneauthorimitatesanotherauthor– Par%cipantswereaskedtoimitateCormacMcCarthyinwri%ngabouttheirday

•  Obfusca%onaUack– Whereanauthorhideshisregularstyle– Par%cipantswereaskedtowriteabouttheirneighborhoodinadifferentstyle

M.BrennanandR.Greenstadt.Prac%calaUacksagainstauthorshiprecogni%ontechniques.InProceedingsoftheTwenty‐FirstConferenceonInnova%veApplica%onsofAr%ficialIntelligence(IAAI),Pasadena,CA,2009.

Page 10: Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5 10 15 20 25 30 35 40

NumberofAuthors

Accuracyindetec6ngauthorshipofregulardocuments

9‐Feature(NN)

Synonym‐Based

WriteprintsBaseline(SVM)

Random

Morethan80%accurateindetec%ngauthorshipwith40authorsinregular

document

Page 11: Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5 10 15 20 25 30 35 40

NumberofAuthors

Accuracyindetec6ngauthorshipofObfuscateddocuments

9‐Feature(NN)

Synonym‐Based

WriteprintsBaseline(SVM)

Random

Accuracyislessthanrandomchanceinobfuscateddocuments

Page 12: Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5 10 15 20 25 30 35 40

NumberofAuthors

Accuracyindetec6ngauthorshipofImitateddocuments

9‐Feature(NN)

Synonym‐Based

WriteprintsBaseline(SVM)

Random

Accuracyislessthanrandomchanceinimitateddocuments

Page 13: Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

CanwedetectStylis%cDecep%on?

Regular

Obfuscated

Imitated

Page 14: Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

Extended‐Brennan‐GreenstadtCorpus

•  56authors–  12ofthepar%cipantsarefromDrexelUniversity–  TherestarepaidworkersfromAmazonmechanicalturk

•  Threekindsofwri%ngsamples–  Regularwri%ngs(5000words)–  Imitatedwri%ng

•  A500‐wordar%cledescribingaday•  ImitateCormacMcCarthyfrom`TheRoad’

–  Obfuscatedwri%ng•  A500‐wordar%cledescribingneighborhood•  Hideownwri%ngstyle

Page 15: Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

Detec%ngstylis%cdecep%onispossible

98

8589.5

95.775.3

59.9

94.5

4843

0

10

20

30

40

50

60

70

80

90

100

Regular Imita%on Obfusca%on

Writeprint,SVM

Lying‐detec%on,J48

9‐featureset,J48

Page 16: Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

‐80 ‐60 ‐40 ‐20 0 20 40 60 80 100

Averagesentencelength

Gunning‐Fogreadabilityindex

Cardinalnumber

Adjec%ve

Averagewordlength

Averagesyllablesperword

Existen%althere

Adverb

Uniquewords

Verb

ShortWords

Par%cle

Sentencecount

Personalpronoun

Imita%on

Obfusca%on

FeatureChangesinImita6onandObfusca6on

Page 17: Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

Problemwiththedataset:TopicSimilarity

•  Allthedecep%vedocumentswereofsametopic.

•  Non‐content‐specificfeatureshavesame

effectascontent‐specific

features.!"

!#$"

!#%"

!#&"

!#'"

!#("

!#)"

!#*"

!#+"

!#,"

$"

-.-/0123" 4567804" 29:7;<0123"

!"#

$%&'($)

*+,$($-.)/(+0-1)2%#34$&)

5,$6.)78)9+,$($-.)8$%.'($)&$.)+-)9$.$60-1)

%9:$(&%(+%4)%'.;7(&;+3)

=>3/0<1<"

?5@-<08"

A23/53/"

Page 18: Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

Hemingway‐FaulknerImita%onCorpus

•  Ar%clesfromtheInterna%onalImita%onHemingwayContest(2000‐2005)

•  Ar%clesfromtheFauxFaulknerContest(2001‐2005)

•  OriginalexcerptsofErnestHemingwayandWilliamFaulkner

Page 19: Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

Decep%ondetec%onispossibleevenwhenthetopicisnotsimilar

•  81.2%accurateindetec%ngimitateddocuments.

Page 20: Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

Longtermdecep%on:AGayGirlInDamascus

– Originalauthorwasa40‐yearoldAmericanci%zen,ThomasMacMaster.

–  PretendedtobeaSyriangaywoman,AminaArraf.–  Theauthorworkedforatleast5yearstocreateanewstyle.

ThomasMacMaster.FakepictureofAminaArraf.

Page 21: Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

Longtermdecep%onishardtodetect

•  Noneoftheblogpostswerefoundtobedecep%ve.

•  Butregularauthorshiprecogni%oncanhelp.•  WetriedtoaUributeauthorshipoftheblogpostsusingThomas(ashimself),Thomas(asAmina),BriUa(Thomas’swife).

Page 22: Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

Longtermdecep%onAuthorshiprecogni%onoftheblog

posts

54% 43% 3%

ThomasMacMaster. AminaArraf BriUa(Thomas’swife)

Page 23: Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

Futureworks

•  Intrusiondetec%on•  Socialspamdetec%on

•  Iden%fyingqualitydiscourse

Page 24: Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

TwoTools

•  JStylo:AuthorshipRecogni%onAnalysisTool.•  Anonymouth:AuthorshipRecogni%onEvasionTool.

•  Free,OpenSource.(GNUGPL)•  AlphareleasesavailabletodayathUps://psal.cs.drexel.edu– Migra%ngtoGitHubsoon.

Page 25: Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

Privacy,SecurityandAutoma%onLab(hUps://psal.cs.drexel.edu)

•  Faculty–  Dr.RachelGreenstadt

•  GraduateStudents–  SadiaAfroz(Decep%onDetec%onLead)–  DiamondBishop– MichaelBrennan–  AylinCaliskan–  ArielStolerman(JStyloLeadDeveloper)

•  UndergraduateStudents–  PavanKantharaju–  AndrewMcDonald(AnonymouthLeadDeveloper)