pdf mirage: content masking attack against information ... · pdf mirage: content masking attack...
TRANSCRIPT
PDFMirage:ContentMaskingAttackAgainstInformation-BasedOnlineServices
IanMarkwood*,Dakun Shen*,YaoLiu,andZhuo LuUniversityofSouthFlorida
*Co-firstauthors
PresentedbyIanMarkwood
Outline
• Motivation• BackgroundInformation• ContentMaskingAttack– AgainstConferenceReviewerAssignmentSystems– AgainstPlagiarismDetection– AgainstDocumentIndexing
• ContentMaskingDefense• Conclusion
Motivation
• TheAdobePortableDocumentFormat(PDF)isthestandardforconsistentcross-computerdocumentrendering
• PDFdocumentscannotbeeditedwithcommonlyaccessibletools(MSWord,AdobeReader,etc.)
• Thisconfersasenseofintegritytothedocumentfortheenduser
Motivation
• ThereisadisconnectbetweenthecontentofaPDFandwhatisactuallydisplayed
• Acomputerandahumanseetwodifferentthings
Motivation
• WithinthisdisconnectwecanperformacontentmaskingattackwhichcompromisesthecontentintegrityofPDFfiles
• Threeinformation-basedonlinesystemsrelyontheintegrityofPDFdocuments:– Automaticreviewerassignmentsystemsforacademicpapers
– Plagiarismdetectionsystems– Searchengines
Outline
• Motivation• BackgroundInformation• ContentMaskingAttack– AgainstConferenceReviewerAssignmentSystems– AgainstPlagiarismDetection– AgainstDocumentIndexing
• ContentMaskingDefense• Conclusion
BackgroundInformation
• Whatdotheseserviceshaveincommon?– TheysupportPDFsubmission– TheyscrapethetextoutofsubmittedPDFfilestoperformtheirfunction,ratherthanusingOpticalCharacterRecognition(OCR)
– TextscrapingcopiestheplaintextoutofallstringswithinthePDFfile
– Ignoresfontassociatedwithtext
BackgroundInformation
• Automaticconferencereviewerassignmentsystems– Usetopicmatchingtoassignreviewerstosubmittedpapers
– Comparefrequentwordsappearinginreviewers’publishedpaperstofrequentwordsappearinginsubmittedpapers
– INFOCOMusesLatentSemanticIndexing(LSI)
BackgroundInformation
• Plagiarismdetectionsystems–Measuresimilaritybetweenstringswithinsubjectdocumentandallotherdocumentssubmittedthusfar
• Documentindexing– Searchenginesreturndocumentsbasedonthesimilarityoftheircontenttothesearchstring
Outline
• Motivation• BackgroundInformation• ContentMaskingAttack– AgainstConferenceReviewerAssignmentSystems– AgainstPlagiarismDetection– AgainstDocumentIndexing
• ContentMaskingDefense• Conclusion
ContentMaskingAttack
plaintext cipher
ciphertext
ContentMaskingAttack
• “Maskingfont”– acustomfontwithsomerearrangementofthecharacter/glyphrelationship
• OpensourcetoolssuchasFontForgeallowcopy/pasteofcharacterglyphswithinfonts
• CustomfontsmaybeimportedintoLATEX
Outline
• Motivation• BackgroundInformation• ContentMaskingAttack– AgainstConferenceReviewerAssignmentSystems– AgainstPlagiarismDetection– AgainstDocumentIndexing
• ContentMaskingDefense• Conclusion
ContentMaskingAttackAgainstAutomaticConferenceReviewerAssignmentSystems
• Anauthorcantargetaspecificreviewerbyreplacingenoughkeywordsinthepaperwithkeywordsfromthereviewer’spapers
• Keywords– uncommonwordsthatappearmostfrequently
ContentMaskingAttackAgainstAutomaticConferenceReviewerAssignmentSystems
• Algorithm:– Orderkeywordsinsubjectpaperandtargetreviewer’scorpusbydescendingfrequency
– Constructa“wordmapping”betweenthesetwolists
– Createa“charactermapping”betweenthelettersofeachpairofwords
ContentMaskingAttackAgainstAutomaticConferenceReviewerAssignmentSystems
• Challenges:– One-to-ManyCharacterMapping–WordLengthDisparity
ContentMaskingAttackAgainstAutomaticConferenceReviewerAssignmentSystems
• Experiment:–WehavereproducedtheINFOCOMautomaticreviewerassignmentsystem
– Thisincludes114TPCmembersfromawell-knownsecurityconferenceand2094oftheirrecentlypublishedpapersfortraining
– 100additionalpapersusedastestingdata
ContentMaskingAttackAgainstAutomaticConferenceReviewerAssignmentSystems
• Experiment:–Matchingapapertoonereviewer
Similarityscoresrelativetoamountofwordsmasked.Bluestarsshowthedesiredmatching.
ContentMaskingAttackAgainstAutomaticConferenceReviewerAssignmentSystems
• Experiment:–Matchingapapertoonereviewer
Wordmaskingrequirementsforall100testingpapers
ContentMaskingAttackAgainstAutomaticConferenceReviewerAssignmentSystems
• Experiment:–Matchingapapertoonereviewer
Maskingfontrequirementsforall100testingpapers
ContentMaskingAttackAgainstAutomaticConferenceReviewerAssignmentSystems
• Experiment:–Matchingapapertomultiplereviewers
Similarityscoresrelativetoamountofwordsmasked,betweenapaperandthreereviewers.Bluestars,blackcircles,
andgreentrianglesshowthedesiredmatchings
Outline
• Motivation• BackgroundInformation• ContentMaskingAttack– AgainstConferenceReviewerAssignmentSystems– AgainstPlagiarismDetection– AgainstDocumentIndexing
• ContentMaskingDefense• Conclusion
ContentMaskingAttackAgainstPlagiarismDetection
• Acheatingstudentcanevadeaplagiarismdetectorbyreplacingtheunderlyingtextwithgibberish
• Usea“scramblingfont”torenderthegibberishaslegible(plagiarized)text
• Resultsinzerosimilaritywithexistingwork
ContentMaskingAttackAgainstPlagiarismDetection
• Zerosimilarityisunrealisticduetocommonphrasesinlanguage
• Weevaluatethreemethodstotargetaspecificsimilarityscore
• Eachmethodchooseswhattexttoscrambleandwhattexttoleaveunaltered
ContentMaskingAttackAgainstPlagiarismDetection
• Byletter– Usescramblingfontwhichscramblesallcharacters
– Removecharactersfrombeingscrambledbyorderoftheirfrequencyofappearanceinthelanguage
– Continueremovingcharactersuntilatargetsimilarityscoreisreached
ContentMaskingAttackAgainstPlagiarismDetection
• Byword,infrequencyofappearance– Usescramblingfontwhichscramblesallcharacters
– Orderdistinctwordsbyfrequencyofappearance– Applyscramblingfonttoallwords– Removescramblingfontfromdistinctwordsuntilatargetsimilarityscoreisreached
ContentMaskingAttackAgainstPlagiarismDetection
• Byword,atrandom– Usescramblingfontwhichscramblesallcharacters
– Iterateoverdocument,applyingscramblingfontatrandomaccordingtochosenprobability
–Modifyprobabilityuntilatargetsimilarityscoreisreached
ContentMaskingAttackAgainstPlagiarismDetection
• Experiment:– Applyscramblingfontsto10publishedpapersandtarget5-15%similarityscoremeasuredbyTurnitin
Outline
• Motivation• BackgroundInformation• ContentMaskingAttack– AgainstConferenceReviewerAssignmentSystems– AgainstPlagiarismDetection– AgainstDocumentIndexing
• ContentMaskingDefense• Conclusion
ContentMaskingAttackAgainstDocumentIndexing
• AnattackercanplacespamorillicitcontentinPDFdocumentsindexedbysearchengines
• ThesePDFscanshowadsinsteadoflegitimatecontentthatuserssearchfor
ContentMaskingAttackAgainstDocumentIndexing
• Thiscanbeconsideredaspecialcaseofthereviewerassignmentsystemsubversionmethod
• Insteadofmaskingparticularwords,wearemaskingtheentiredocument
• Notconstrainedbyspaceshowever
ContentMaskingAttackAgainstDocumentIndexing
• Thelargernumberofmaskedcharactersrequiresmoremaskingfonts
• Insteadofgeneratingfontsadhoc,wemakeonefontforeachglyph
• ~84fonts• Allowsforeasyautomatedgenerationofmaskeddocuments
ContentMaskingAttackAgainstDocumentIndexing
• Experiment– Used5well-knownpublishedpapers–Maskedeachasgibberish
ContentMaskingAttackAgainstDocumentIndexing
• Experiment– Submittedthemtoleadingsearchenginesforindexing(Google,Bing,Yahoo!,DuckDuckGo)
– Resultswerethesameforalltestdocuments
ContentMaskingAttackAgainstDocumentIndexing
• Experiment
SearchEngine
IndexedPapers
AttackSuccessful
EvadesSpamDetection
NotLaterRemoved
Google ✔ ✘ ✘ ✘
Bing ✔ ✔ ✔ ✔
Yahoo! ✔ ✔ ✘à✔ ✔
DuckDuckGo ✔ ✔ ✔ ✔
ContentMaskingAttackAgainstDocumentIndexing
• Experiment
Outline
• Motivation• BackgroundInformation• ContentMaskingAttack– AgainstConferenceReviewerAssignmentSystems– AgainstPlagiarismDetection– AgainstDocumentIndexing
• ContentMaskingDefense• Conclusion
ContentMaskingDefense
• Onefeasible defense:performOpticalCharacterRecognition(OCR)onthedocumenttochecktheintegrityofeachcharacter.
• Problem:– Highcomputationaloverhead– Highfalsepositiverate
50,000- 75,000characters
ContentMaskingDefense– Ourproposal
• RendereachcharacterinthefontsembeddedinthesubjectPDFfileandperformOCRonthosecharactercodesratherthantherenderedPDFfileitself.
• Saveprocessingtime
100-2000characters
50,000- 75,000characters
ChallengesandTechnicalDetails
• Challenge1:Wholefontfileisembedded– Contain2"# = 65,536 charactersmaximum– Causehighcomputationaloverhead
• Solution:Scanthedocumenttoextractthecharactersused,andperformOCRontheseriesofcharacterusedineachfont.
ChallengesandTechnicalDetails
• Challenge2:Specialcharacters
pUnicode:0xfe
þUnicode:0x70
OCR
Unicodemismatch
Falsealarm
ChallengesandTechnicalDetails
• Solution:FontTraining1. PerformOCRonthefontandlistallsimilar
characters.2. Ifthedetectedglyphisinthesimilarcharacter
list,replacethecharacter’sUnicodeasthenormalletteritlookslike.
FontTraining
Unicode:0xfe
þ
Inthelist
ChangeUnicode
Unicode:0x70
Whitelist
ã0xe3
a0x61
ɧ0x267
h0x68
Ѡ0x460
W0x57
…… ……
Þ0xfe
p0x70
…… ……
FontVerificationPerformance
• Experiment1– ToanalyzetheaccuracyofourFontVerificationmethodandtheWholeDocumentOCRmethod
– Generated10PDFfileswithmaskedcharactersvaryingfrom5-20%infrequencyofappearance
Performance– Experiment1
FontVerificationPerformance
• Experiment2– Toanalyzetheeffectsofdocumentlengthonthedetectionrateforeachmethod.
– Generated10PDFfilesrangingfrom1-10pagesinlengthandhavinganeven30%distributionofmaskedcharacters
Performance– Experiment2
FontVerificationPerformance
• Experiment3– Toanalyzetheeffectofdocumentlengthonthedetectiontimeforeachmethod
– Generated20PDFfilesrangingfrom1-20pagesinlengthandhavinga30%distributionofmaskedcharacters
Performance– Experiment3
Outline
• Motivation• BackgroundInformation• ContentMaskingAttack– AgainstConferenceReviewerAssignmentSystems– AgainstPlagiarismDetection– AgainstDocumentIndexing
• ContentMaskingDefense• Conclusion
Conclusion
• WedescribeanewcontentmaskingattackagainsttheAdobePDFstandard
• Wecreateandevaluatealgorithmsforeffectivelyperformingattacksagainst:– Automaticreviewerassignmentsystems– Plagiarismdetection– Documentindexing
• WecreateandevaluateafontverificationalgorithmthatismoreaccurateandlightweightthanOCR
Thankyou!
• Questions?
PDFfileimagefromhttp://iconbug.com/detail/icon/5940/file-format-pdf/TrueTypefontfileimagefromhttps://typography.guru/journal/opentype-myths-explained-r24/