pdf mirage: content masking attack against information ... · pdf mirage: content masking attack...

52
PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, Dakun Shen*, Yao Liu, and Zhuo Lu University of South Florida *Co-first authors Presented by Ian Markwood

Upload: others

Post on 25-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

PDFMirage:ContentMaskingAttackAgainstInformation-BasedOnlineServices

IanMarkwood*,Dakun Shen*,YaoLiu,andZhuo LuUniversityofSouthFlorida

*Co-firstauthors

PresentedbyIanMarkwood

Page 2: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

Outline

• Motivation• BackgroundInformation• ContentMaskingAttack– AgainstConferenceReviewerAssignmentSystems– AgainstPlagiarismDetection– AgainstDocumentIndexing

• ContentMaskingDefense• Conclusion

Page 3: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

Motivation

• TheAdobePortableDocumentFormat(PDF)isthestandardforconsistentcross-computerdocumentrendering

• PDFdocumentscannotbeeditedwithcommonlyaccessibletools(MSWord,AdobeReader,etc.)

• Thisconfersasenseofintegritytothedocumentfortheenduser

Page 4: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

Motivation

• ThereisadisconnectbetweenthecontentofaPDFandwhatisactuallydisplayed

• Acomputerandahumanseetwodifferentthings

Page 5: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

Motivation

• WithinthisdisconnectwecanperformacontentmaskingattackwhichcompromisesthecontentintegrityofPDFfiles

• Threeinformation-basedonlinesystemsrelyontheintegrityofPDFdocuments:– Automaticreviewerassignmentsystemsforacademicpapers

– Plagiarismdetectionsystems– Searchengines

Page 6: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

Outline

• Motivation• BackgroundInformation• ContentMaskingAttack– AgainstConferenceReviewerAssignmentSystems– AgainstPlagiarismDetection– AgainstDocumentIndexing

• ContentMaskingDefense• Conclusion

Page 7: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

BackgroundInformation

• Whatdotheseserviceshaveincommon?– TheysupportPDFsubmission– TheyscrapethetextoutofsubmittedPDFfilestoperformtheirfunction,ratherthanusingOpticalCharacterRecognition(OCR)

– TextscrapingcopiestheplaintextoutofallstringswithinthePDFfile

– Ignoresfontassociatedwithtext

Page 8: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

BackgroundInformation

• Automaticconferencereviewerassignmentsystems– Usetopicmatchingtoassignreviewerstosubmittedpapers

– Comparefrequentwordsappearinginreviewers’publishedpaperstofrequentwordsappearinginsubmittedpapers

– INFOCOMusesLatentSemanticIndexing(LSI)

Page 9: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

BackgroundInformation

• Plagiarismdetectionsystems–Measuresimilaritybetweenstringswithinsubjectdocumentandallotherdocumentssubmittedthusfar

• Documentindexing– Searchenginesreturndocumentsbasedonthesimilarityoftheircontenttothesearchstring

Page 10: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

Outline

• Motivation• BackgroundInformation• ContentMaskingAttack– AgainstConferenceReviewerAssignmentSystems– AgainstPlagiarismDetection– AgainstDocumentIndexing

• ContentMaskingDefense• Conclusion

Page 11: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ContentMaskingAttack

plaintext cipher

ciphertext

Page 12: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ContentMaskingAttack

• “Maskingfont”– acustomfontwithsomerearrangementofthecharacter/glyphrelationship

• OpensourcetoolssuchasFontForgeallowcopy/pasteofcharacterglyphswithinfonts

• CustomfontsmaybeimportedintoLATEX

Page 13: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

Outline

• Motivation• BackgroundInformation• ContentMaskingAttack– AgainstConferenceReviewerAssignmentSystems– AgainstPlagiarismDetection– AgainstDocumentIndexing

• ContentMaskingDefense• Conclusion

Page 14: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ContentMaskingAttackAgainstAutomaticConferenceReviewerAssignmentSystems

• Anauthorcantargetaspecificreviewerbyreplacingenoughkeywordsinthepaperwithkeywordsfromthereviewer’spapers

• Keywords– uncommonwordsthatappearmostfrequently

Page 15: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ContentMaskingAttackAgainstAutomaticConferenceReviewerAssignmentSystems

• Algorithm:– Orderkeywordsinsubjectpaperandtargetreviewer’scorpusbydescendingfrequency

– Constructa“wordmapping”betweenthesetwolists

– Createa“charactermapping”betweenthelettersofeachpairofwords

Page 16: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ContentMaskingAttackAgainstAutomaticConferenceReviewerAssignmentSystems

• Challenges:– One-to-ManyCharacterMapping–WordLengthDisparity

Page 17: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ContentMaskingAttackAgainstAutomaticConferenceReviewerAssignmentSystems

• Experiment:–WehavereproducedtheINFOCOMautomaticreviewerassignmentsystem

– Thisincludes114TPCmembersfromawell-knownsecurityconferenceand2094oftheirrecentlypublishedpapersfortraining

– 100additionalpapersusedastestingdata

Page 18: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ContentMaskingAttackAgainstAutomaticConferenceReviewerAssignmentSystems

• Experiment:–Matchingapapertoonereviewer

Similarityscoresrelativetoamountofwordsmasked.Bluestarsshowthedesiredmatching.

Page 19: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ContentMaskingAttackAgainstAutomaticConferenceReviewerAssignmentSystems

• Experiment:–Matchingapapertoonereviewer

Wordmaskingrequirementsforall100testingpapers

Page 20: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ContentMaskingAttackAgainstAutomaticConferenceReviewerAssignmentSystems

• Experiment:–Matchingapapertoonereviewer

Maskingfontrequirementsforall100testingpapers

Page 21: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ContentMaskingAttackAgainstAutomaticConferenceReviewerAssignmentSystems

• Experiment:–Matchingapapertomultiplereviewers

Similarityscoresrelativetoamountofwordsmasked,betweenapaperandthreereviewers.Bluestars,blackcircles,

andgreentrianglesshowthedesiredmatchings

Page 22: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

Outline

• Motivation• BackgroundInformation• ContentMaskingAttack– AgainstConferenceReviewerAssignmentSystems– AgainstPlagiarismDetection– AgainstDocumentIndexing

• ContentMaskingDefense• Conclusion

Page 23: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ContentMaskingAttackAgainstPlagiarismDetection

• Acheatingstudentcanevadeaplagiarismdetectorbyreplacingtheunderlyingtextwithgibberish

• Usea“scramblingfont”torenderthegibberishaslegible(plagiarized)text

• Resultsinzerosimilaritywithexistingwork

Page 24: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ContentMaskingAttackAgainstPlagiarismDetection

• Zerosimilarityisunrealisticduetocommonphrasesinlanguage

• Weevaluatethreemethodstotargetaspecificsimilarityscore

• Eachmethodchooseswhattexttoscrambleandwhattexttoleaveunaltered

Page 25: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ContentMaskingAttackAgainstPlagiarismDetection

• Byletter– Usescramblingfontwhichscramblesallcharacters

– Removecharactersfrombeingscrambledbyorderoftheirfrequencyofappearanceinthelanguage

– Continueremovingcharactersuntilatargetsimilarityscoreisreached

Page 26: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ContentMaskingAttackAgainstPlagiarismDetection

• Byword,infrequencyofappearance– Usescramblingfontwhichscramblesallcharacters

– Orderdistinctwordsbyfrequencyofappearance– Applyscramblingfonttoallwords– Removescramblingfontfromdistinctwordsuntilatargetsimilarityscoreisreached

Page 27: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ContentMaskingAttackAgainstPlagiarismDetection

• Byword,atrandom– Usescramblingfontwhichscramblesallcharacters

– Iterateoverdocument,applyingscramblingfontatrandomaccordingtochosenprobability

–Modifyprobabilityuntilatargetsimilarityscoreisreached

Page 28: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ContentMaskingAttackAgainstPlagiarismDetection

• Experiment:– Applyscramblingfontsto10publishedpapersandtarget5-15%similarityscoremeasuredbyTurnitin

Page 29: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

Outline

• Motivation• BackgroundInformation• ContentMaskingAttack– AgainstConferenceReviewerAssignmentSystems– AgainstPlagiarismDetection– AgainstDocumentIndexing

• ContentMaskingDefense• Conclusion

Page 30: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ContentMaskingAttackAgainstDocumentIndexing

• AnattackercanplacespamorillicitcontentinPDFdocumentsindexedbysearchengines

• ThesePDFscanshowadsinsteadoflegitimatecontentthatuserssearchfor

Page 31: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ContentMaskingAttackAgainstDocumentIndexing

• Thiscanbeconsideredaspecialcaseofthereviewerassignmentsystemsubversionmethod

• Insteadofmaskingparticularwords,wearemaskingtheentiredocument

• Notconstrainedbyspaceshowever

Page 32: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ContentMaskingAttackAgainstDocumentIndexing

• Thelargernumberofmaskedcharactersrequiresmoremaskingfonts

• Insteadofgeneratingfontsadhoc,wemakeonefontforeachglyph

• ~84fonts• Allowsforeasyautomatedgenerationofmaskeddocuments

Page 33: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ContentMaskingAttackAgainstDocumentIndexing

• Experiment– Used5well-knownpublishedpapers–Maskedeachasgibberish

Page 34: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ContentMaskingAttackAgainstDocumentIndexing

• Experiment– Submittedthemtoleadingsearchenginesforindexing(Google,Bing,Yahoo!,DuckDuckGo)

– Resultswerethesameforalltestdocuments

Page 35: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ContentMaskingAttackAgainstDocumentIndexing

• Experiment

SearchEngine

IndexedPapers

AttackSuccessful

EvadesSpamDetection

NotLaterRemoved

Google ✔ ✘ ✘ ✘

Bing ✔ ✔ ✔ ✔

Yahoo! ✔ ✔ ✘à✔ ✔

DuckDuckGo ✔ ✔ ✔ ✔

Page 36: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ContentMaskingAttackAgainstDocumentIndexing

• Experiment

Page 37: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

Outline

• Motivation• BackgroundInformation• ContentMaskingAttack– AgainstConferenceReviewerAssignmentSystems– AgainstPlagiarismDetection– AgainstDocumentIndexing

• ContentMaskingDefense• Conclusion

Page 38: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ContentMaskingDefense

• Onefeasible defense:performOpticalCharacterRecognition(OCR)onthedocumenttochecktheintegrityofeachcharacter.

• Problem:– Highcomputationaloverhead– Highfalsepositiverate

50,000- 75,000characters

Page 39: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ContentMaskingDefense– Ourproposal

• RendereachcharacterinthefontsembeddedinthesubjectPDFfileandperformOCRonthosecharactercodesratherthantherenderedPDFfileitself.

• Saveprocessingtime

100-2000characters

50,000- 75,000characters

Page 40: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ChallengesandTechnicalDetails

• Challenge1:Wholefontfileisembedded– Contain2"# = 65,536 charactersmaximum– Causehighcomputationaloverhead

• Solution:Scanthedocumenttoextractthecharactersused,andperformOCRontheseriesofcharacterusedineachfont.

Page 41: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ChallengesandTechnicalDetails

• Challenge2:Specialcharacters

pUnicode:0xfe

þUnicode:0x70

OCR

Unicodemismatch

Falsealarm

Page 42: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

ChallengesandTechnicalDetails

• Solution:FontTraining1. PerformOCRonthefontandlistallsimilar

characters.2. Ifthedetectedglyphisinthesimilarcharacter

list,replacethecharacter’sUnicodeasthenormalletteritlookslike.

Page 43: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

FontTraining

Unicode:0xfe

þ

Inthelist

ChangeUnicode

Unicode:0x70

Whitelist

ã0xe3

a0x61

ɧ0x267

h0x68

Ѡ0x460

W0x57

…… ……

Þ0xfe

p0x70

…… ……

Page 44: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

FontVerificationPerformance

• Experiment1– ToanalyzetheaccuracyofourFontVerificationmethodandtheWholeDocumentOCRmethod

– Generated10PDFfileswithmaskedcharactersvaryingfrom5-20%infrequencyofappearance

Page 45: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

Performance– Experiment1

Page 46: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

FontVerificationPerformance

• Experiment2– Toanalyzetheeffectsofdocumentlengthonthedetectionrateforeachmethod.

– Generated10PDFfilesrangingfrom1-10pagesinlengthandhavinganeven30%distributionofmaskedcharacters

Page 47: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

Performance– Experiment2

Page 48: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

FontVerificationPerformance

• Experiment3– Toanalyzetheeffectofdocumentlengthonthedetectiontimeforeachmethod

– Generated20PDFfilesrangingfrom1-20pagesinlengthandhavinga30%distributionofmaskedcharacters

Page 49: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

Performance– Experiment3

Page 50: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

Outline

• Motivation• BackgroundInformation• ContentMaskingAttack– AgainstConferenceReviewerAssignmentSystems– AgainstPlagiarismDetection– AgainstDocumentIndexing

• ContentMaskingDefense• Conclusion

Page 51: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

Conclusion

• WedescribeanewcontentmaskingattackagainsttheAdobePDFstandard

• Wecreateandevaluatealgorithmsforeffectivelyperformingattacksagainst:– Automaticreviewerassignmentsystems– Plagiarismdetection– Documentindexing

• WecreateandevaluateafontverificationalgorithmthatismoreaccurateandlightweightthanOCR

Page 52: PDF Mirage: Content Masking Attack Against Information ... · PDF Mirage: Content Masking Attack Against Information-Based Online Services Ian Markwood*, DakunShen*, Yao Liu, and

Thankyou!

• Questions?

PDFfileimagefromhttp://iconbug.com/detail/icon/5940/file-format-pdf/TrueTypefontfileimagefromhttps://typography.guru/journal/opentype-myths-explained-r24/