authorship attribution approaches for the lithuanian ... · authorship attribution (aa) aa –a...

J U R G I TA KAPOČIŪTĖ -DZIKIENĖ

Authorship Attribution Approaches for the

Lithuanian Language Using Internet

Comments

Authorship Attribution (AA)

AA – a task of identifying who from a set of candidate

authors is an actual author of a given anonymous text

Based on existing “human stylome” notion

(Van Halteren, 2005)

AA is one of the oldest problems, highly topical nowadays

Applications

Forensics Security

Electronic

commerceLiterary

applications

PAST NOW

Related works (1/2)

The research is done with:

e-mails, web forum messages, online chats, Internet blogs, tweets

/short texts, non-normative language/

hundreds (Luyckx and Daelemans, 2008) or thousands authors

(Koppel et al., 2011); (Narayanan et al., 2012)

/larger candidate author sets/

Related works (2/2)

Concept of idiolect for the first time discussed in 1971

(Pikčilingis, 1971)

Lots of descriptive linguistic works

(e.g. Žalkauskaitė, 2012; Venčkauskas et al., 2015)

AA research with:

Reference Texts Authors Methods

(Kapočiūtė-Dzikienė et al., 2015) Parliamentary transcripts

forum posts

100 Machine learning

(Kapočiūtė-Dzikienė et al., 2015a) Fiction texts 3; 5; 10; 20;50;100 Machine learning

(Kapočiūtė-Dzikienė et al., 2015b) Internet comments 1,000 Similarity-based

The corpus*

Internet comments from www.delfi.lt (topics “in Lithuania” and “Abroad”) in January 2015 – August 2015

To get clean corpus none of these authors were included: if the same pseudonym was used for writing from different IP addresses (dynamic IP

problem)

if different pseudonyms were used under the same IP address

Replies, meta-information and non-Lithuanian alphabetic letters (except punctuation marks and digits) were filtered out

No texts shorter than 30 symbols (excluding white-spaces)

Texts with out-of-vocabulary words, missing diacritics, etc.

/spoken non-normative Lithuanian language/

* At: http://dangus.vdu.lt/~jkd/wp-content/uploads/2015/04/INT_KOMENTARAI_INDV2.7z

http://www.delfi.lt/

Corpus statistics

Number of authors 10 100 1,000

Number of texts 14,443 63,131 155,078

Number of tokens (words/digits) 289,462 1,511,823 4,068,231

Avg. text length in tokens 20.042 23.947 26.233

Random baseline 0.001 0.002 0.003

Majority baseline 0.018 0.018 0.018

Main research directions

Methods: Machine Learning (ML): Naïve Bayes – NB; Support Vector Machine – SVM)

Similarity-Based – SB (cosine similarity measure)

Feature types: lex – lexical

lem – lemmas – using Lemuoklis (Zinkevičius, 20000)

chr4 – word-level character tetra-grams

Dimensionality reduction techniques: Entire feature set

Feature ranking (using chi-squared) and TopN (N=30 thousand)

Random feature set of N features in K iterations (K=20) – RFS (offered by Koppel et al., 2011)

Experimental setup

Stratified dataset

80%/20% for training/testing, respectively

Calculated accuracies/f-score values

McNemar’s (McNemar, 1947) test for statistical

significance ( = 0.05)

10 candidate authors*0

.61

1

0.6

08

0.6

92

0.5

82

0.6

98

0.7

10

0.7

26

0.6

91

0.7

58

0.6

75 0.7

44

0.7

44

0.2

26 0

.30

2

0.2

08

0.2

17

0.1

75 0

.24

9

0.3

66

0.2

17

0.1

84

0.5

90

0.5

90

0.5

61

0.5

66

0.6

81

0.6

94

0.7

09

0.6

72

0.6

77

0.6

58 0

.73

0

0.7

28

0.2

11 0

.27

4

0.1

86

0.1

92

0.1

64 0

.24

3

0.3

33

0.1

80

0.1

70

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

lex lem chr4 lex lem chr4 lex lem chr4 lex lem chr4

NBM SVM SB-TopN SB-RFS

* Accuracy and f-score in white and gray columns, respectively.

The first two columns – results on the entire feature set; the second two – on N=30,000 features

The dashed line indicates higher of random and majority baselines

differences are statistically significant

100 candidate authors0

.21

3

0.1

85

0.1

92

0.1

77

0.2

58

0.2

74

0.4

35

0.3

31

0.4

13

0.3

46

0.4

67

0.4

25

0.0

83

0.1

04

0.0

94

0.0

54 0

.09

8

0.0

42

0.0

60

0.0

58

0.0

42

0.1

47

0.1

21

0.1

28

0.1

14

0.1

84

0.1

99

0.4

14

0.3

13

0.3

97

0.3

29

0.4

40

0.3

95

0.0

70

0.0

95

0.0

80

0.0

39

0.0

98

0.0

34

0.0

44

0.0

44

0.0

34

0.0

0.1

0.2

0.3

0.4

0.5



differences are statistically significant

1,000 candidate authors0

.10

3

0.0

92

0.0

75

0.0

61

0.0

97

0.0

84

0.2

39

0.1

91

0.2

35

0.1

94

0.2

67

0.2

28

0.0

34

0.0

24

0.0

25

0.0

19 0.0

39

0.0

26

0.0

28

0.0

17

0.0

25

0.0

25

0.0

24

0.0

15

0.0

22

0.0

23

0.0

20

0.1

54

0.1

37 0.1

56

0.1

43

0.1

80

0.1

85

0.0

29

0.0

20

0.0

21

0.0

16

0.0

32

0.0

22

0.0

24

0.0

15

0.0

21

0.0

0.1

0.2

0.3



differences are not statistically significant

Summary of results

Not all results exceed random and majority baselines:

Lemmatizer is not adjusted to cope with the non-normative language

ML > SB

chr4 > lex > lem (on the larger author sets)

Entire feature set > RFS (N = 30,000; K = 20) > TopN (N = 30,000)

Conclusions and future work

Conclusions: The first comparative AA results (in terms of methods, feature

types, dimensionality reduction techniques) for the Lithuanian language using:

Internet comments

a big author set (i.e., 1,000 candidate authors)

The best results (especially on the larger author sets) with ML + chr4

Future work: Error analysis and improvements

Expanded number of candidate authors

Different types of non-normative texts (e.g., social networks data)

Researchers who also worked on this topic

Ligita Šarkutė

KUT

Andrius Utka

VMU

Robertas Damaševičius

KUT

Algimantas Venčkauskas

KUT

Lithuanian Cybercrime Centre of

Excellence for Training,

Research and Education

(HOME/2013/ISEC/AG/INT/400005176)

Automatic extraction of style applied to

individual authors and groups of authors

(ASTRA) (LIT-8-69)

http://dangus.vdu.lt/~jkd/eng/

http://dangus.vdu.lt/~jkd/eng/

Thank you for your attention!

j u rg i ta . k . dz@g mai l . com

authorship attribution approaches for the lithuanian ... · authorship attribution (aa) aa –a...

Documents