n-gram overlap i n automatic detection o f document derivation

23

Upload: meira

Post on 09-Feb-2016

27 views

Category:

Documents


0 download

DESCRIPTION

N-gram Overlap i n Automatic Detection o f Document Derivation. Siniša Bosanac, Vanja Štefanec {sbosanac;vstefane}@ffzg.hr Department of Information Sciences Faculty of Humanities and Social Sciences University of Zagreb. Introduction. problems of originality and authenthicity - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: N-gram Overlap  i n Automatic  Detection o f Document Derivation
Page 2: N-gram Overlap  i n Automatic  Detection o f Document Derivation

N-gram Overlap in Automatic Detection of Document Derivation

Siniša Bosanac, Vanja Štefanec{sbosanac;vstefane}@ffzg.hr

Department of Information SciencesFaculty of Humanities and Social Sciences

University of Zagreb

Page 3: N-gram Overlap  i n Automatic  Detection o f Document Derivation

Introduction

• problems of originality and authenthicity• increased usage of ICT intensified the

problem• Derivation

– relationship between the two documents in which the source document was used in creating the target document

• Text reuse– process by which content from a source

document is reused in the creation of a target document

• word for word • paraphrase

Page 4: N-gram Overlap  i n Automatic  Detection o f Document Derivation

Examples of content derivation• desirable:

1) quoting2) document updating3) relaying the sponsored content in news media4) automatic and manual summarization

• undesirable: 1) plagiarism2) non-critical relaying of sponsored content in news

media3) journalistic theft

Page 5: N-gram Overlap  i n Automatic  Detection o f Document Derivation

N-gram overlap

• overlapping of n successive linguistic units

• overlapping in longer n-grams can be an indicator of derivation

• representative n-gram length is language-specific

• fast, robust and low-complex method

Page 6: N-gram Overlap  i n Automatic  Detection o f Document Derivation

Measure

• resemblance• Jaccard similarity coefficient

• n-gram types, not tokens

BFAF

BFAFBAr

,

Page 7: N-gram Overlap  i n Automatic  Detection o f Document Derivation

Methodology

• building the text collection• performing measurements• manual classification of compared

pairs into derived and non-derived• ranking of pairs based on

resemblance score

Page 8: N-gram Overlap  i n Automatic  Detection o f Document Derivation

Text collection

• total number of documents: 236• sources:

• digital repository of the Library of the Faculty of Humanities and Social Sciences

• Web news sites• other Web sources

• document size: 69 – 34,397 tokens

Page 9: N-gram Overlap  i n Automatic  Detection o f Document Derivation

Text collection

• types of documents:• 39 diploma papers• 42 scientific articles• 61 news articles• 61 literary columns• 35 documents classified as “other”

• topic:• library science • psychology

Page 10: N-gram Overlap  i n Automatic  Detection o f Document Derivation

Text classification

• topic determined according to:• source• document title• keywords• content

• functional style determined according to:

• classification at source• type of document

Page 11: N-gram Overlap  i n Automatic  Detection o f Document Derivation

Measurements

• using a purpose-built program module

• comparison of each document against every other

• comparison according to n-gram length from 1 to 10

• calculating the resemblance score

• No of derivation pairs: 28,203• derived: 256• non-derived: 27,938

Page 12: N-gram Overlap  i n Automatic  Detection o f Document Derivation

Choosing the most representative n-gram

• finding the F1 maximum on the level of each n-gram

• comparing the F1 measure maxima across all n-gram levels

• determining the resemblance threshold

Page 13: N-gram Overlap  i n Automatic  Detection o f Document Derivation

Derived pairs – trigrams

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642650

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

derivation pair No

rese

mbla

nce

[%

]

Page 14: N-gram Overlap  i n Automatic  Detection o f Document Derivation

Non-derived pairs – trigrams

28841401962523083644204765325886447007568128689249801036109211481204126013161372142814841540159616521708176418201876193219882044210021562212226823242380243624922548260426602716277228282884294029963052310831643220327633323388344435003556361236683724378038363892394840044060411641724228428443404396445245084564462046764732478848444900495650125068512451805236529253485404546055165572562856845740579658525908596460206076613261886244630063566412646865246580663666926748680468606916697270287084714071967252730873647420747675327588764477007756781278687924798080368092814882048260831683728428848485408596865287088764882088768932898890449100915692129268932493809436949295489604966097169772982898849940999610052101081016410220102761033210388104441050010556106121066810724107801083610892109481100411060111161117211228112841134011396114521150811564116201167611732117881184411900119561201212068121241218012236122921234812404124601251612572126281268412740127961285212908129641302013076131321318813244133001335613412134681352413580136361369213748138041386013916139721402814084141401419614252143081436414420144761453214588146441470014756148121486814924149801503615092151481520415260153161537215428154841554015596156521570815764158201587615932159881604416100161561621216268163241638016436164921654816604166601671616772168281688416940169961705217108171641722017276173321738817444175001755617612176681772417780178361789217948180041806018116181721822818284183401839618452185081856418620186761873218788188441890018956190121906819124191801923619292193481940419460195161957219628196841974019796198521990819964200202007620132201882024420300203562041220468205242058020636206922074820804208602091620972210282108421140211962125221308213642142021476215322158821644217002175621812218682192421980220362209222148222042226022316223722242822484225402259622652227082276422820228762293222988230442310023156232122326823324233802343623492235482360423660237162377223828238842394023996240522410824164242202427624332243882444424500245562461224668247242478024836248922494825004250602511625172252282528425340253962545225508255642562025676257322578825844259002595626012260682612426180262362629226348264042646026516265722662826684267402679626852269082696427020270762713227188272442730027356274122746827524275802763627692277482780427860279160

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

derivation pair No

rese

mbla

nce

[%

]

Page 15: N-gram Overlap  i n Automatic  Detection o f Document Derivation

Trigrams – F1, precision, recall

resemblance 0.25 0.3 0.35 0.4 0.45 0.5

F1-measure 0.417 0.517 0.583 0.633 0.656 0.651

precision 0.272 0.370 0.457 0.535 0.602 0.634

recall 0.890 0.852 0.803 0.773 0.720 0.667

0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.550.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.417

0.5170.583 0.633

0.656 0.651

F1-measurepre-ci-sionrecall

resemblance [%]

Page 16: N-gram Overlap  i n Automatic  Detection o f Document Derivation

General results

n-gram1-gram

2-gram

3-gram

4-gram

5-gram

6-gram

7-gram

8-gram

9-gram

10-gram

F1-measure

0,465 0,524 0,656 0,735 0,771 0,820 0,814 0,807 0,777 0,718

1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram 8-gram 9-gram 10-gram

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.465

0.524

0.656

0.735

0.7710.820 0.814 0.807

0.777

0.718

F1-m

easu

re

Page 17: N-gram Overlap  i n Automatic  Detection o f Document Derivation

Function words removed

n-gram1-gram

2-gram

3-gram

4-gram

5-gram

6-gram

7-gram

8-gram

9-gram

10-gram

F1-measure

0,487 0,611 0,704 0,754 0,801 0,803 0,748 0,712 0,661 0,63

1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram 8-gram 9-gram 10-gram

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.488

0.611

0.705

0.7550.801 0.804

0.749

0.7130.662

0.630

general

no func-tion words

F1-m

easu

re

Page 18: N-gram Overlap  i n Automatic  Detection o f Document Derivation

Functional styles

• five functional styles in Croatian• scientific• official• newspaper and publicistic• literary• colloquial

• need to differentiate between functional styles?

Page 19: N-gram Overlap  i n Automatic  Detection o f Document Derivation

Differentiating functional styles

n-gram1-gram

2-gram

3-gram

4-gram

5-gram

6-gram

7-gram

8-gram

9-gram

10-gram

scientific 0,545 0,589 0,762 0,840 0,846 0,855 0,854 0,846 0,835 0,787

publicistic and newspaper

0,622 0,667 0,696 0,691 0,727 0,810 0,754 0,719 0,536 0,423

1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram 8-gram 9-gram 10-gram

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.5450.589

0.762

0.840 0.846 0.855 0.854 0.846 0.835

0.787

0.622

0.6670.696 0.691

0.727

0.8100.754

0.719

0.536

0.423

scientific

publicistic and newspa-per

F1-m

easu

re

Page 20: N-gram Overlap  i n Automatic  Detection o f Document Derivation

Conclusion

• the first research of this kind performed on texts in Croatian language

• 6-grams were shown to be the most representative

• final parameters depend on intended application

Page 21: N-gram Overlap  i n Automatic  Detection o f Document Derivation

Further research

• enlarge and refine the text collection

• experiment with different kinds of text editing

• POS tagging• extracting hapax legomena, stop words, labels,

direct quotes

• focus on a different level of text• characters• sentences• paragraphs

Page 22: N-gram Overlap  i n Automatic  Detection o f Document Derivation

Possible applications

• determining document originality• protection of intelectual property• plagiarism detection

• infometric research• information disemination analysis• citation analysis

Page 23: N-gram Overlap  i n Automatic  Detection o f Document Derivation

Thank you for your attention!