authorship attribution approaches for the lithuanian ... · authorship attribution (aa) aa –a...
TRANSCRIPT
J U R G I TA KAPOČIŪTĖ -DZIKIENĖ
Authorship Attribution Approaches for the
Lithuanian Language Using Internet
Comments
Authorship Attribution (AA)
AA – a task of identifying who from a set of candidate
authors is an actual author of a given anonymous text
Based on existing “human stylome” notion
(Van Halteren, 2005)
AA is one of the oldest problems, highly topical nowadays
Applications
Forensics Security
Electronic
commerceLiterary
applications
PAST NOW
Related works (1/2)
The research is done with:
e-mails, web forum messages, online chats, Internet blogs, tweets
/short texts, non-normative language/
hundreds (Luyckx and Daelemans, 2008) or thousands authors
(Koppel et al., 2011); (Narayanan et al., 2012)
/larger candidate author sets/
Related works (2/2)
Concept of idiolect for the first time discussed in 1971
(Pikčilingis, 1971)
Lots of descriptive linguistic works
(e.g. Žalkauskaitė, 2012; Venčkauskas et al., 2015)
AA research with:
Reference Texts Authors Methods
(Kapočiūtė-Dzikienė et al., 2015) Parliamentary transcripts
forum posts
100 Machine learning
(Kapočiūtė-Dzikienė et al., 2015a) Fiction texts 3; 5; 10; 20;50;100 Machine learning
(Kapočiūtė-Dzikienė et al., 2015b) Internet comments 1,000 Similarity-based
The corpus*
Internet comments from www.delfi.lt (topics “in Lithuania” and “Abroad”) in January 2015 – August 2015
To get clean corpus none of these authors were included: if the same pseudonym was used for writing from different IP addresses (dynamic IP
problem)
if different pseudonyms were used under the same IP address
Replies, meta-information and non-Lithuanian alphabetic letters (except punctuation marks and digits) were filtered out
No texts shorter than 30 symbols (excluding white-spaces)
Texts with out-of-vocabulary words, missing diacritics, etc.
/spoken non-normative Lithuanian language/
* At: http://dangus.vdu.lt/~jkd/wp-content/uploads/2015/04/INT_KOMENTARAI_INDV2.7z
Corpus statistics
Number of authors 10 100 1,000
Number of texts 14,443 63,131 155,078
Number of tokens (words/digits) 289,462 1,511,823 4,068,231
Avg. text length in tokens 20.042 23.947 26.233
Random baseline 0.001 0.002 0.003
Majority baseline 0.018 0.018 0.018
Main research directions
Methods: Machine Learning (ML): Naïve Bayes – NB; Support Vector Machine – SVM)
Similarity-Based – SB (cosine similarity measure)
Feature types: lex – lexical
lem – lemmas – using Lemuoklis (Zinkevičius, 20000)
chr4 – word-level character tetra-grams
Dimensionality reduction techniques: Entire feature set
Feature ranking (using chi-squared) and TopN (N=30 thousand)
Random feature set of N features in K iterations (K=20) – RFS (offered by Koppel et al., 2011)
Experimental setup
Stratified dataset
80%/20% for training/testing, respectively
Calculated accuracies/f-score values
McNemar’s (McNemar, 1947) test for statistical
significance ( = 0.05)
10 candidate authors*0
.61
1
0.6
08
0.6
92
0.5
82
0.6
98
0.7
10
0.7
26
0.6
91
0.7
58
0.6
75 0.7
44
0.7
44
0.2
26 0
.30
2
0.2
08
0.2
17
0.1
75 0
.24
9
0.3
66
0.2
17
0.1
84
0.5
90
0.5
90
0.5
61
0.5
66
0.6
81
0.6
94
0.7
09
0.6
72
0.6
77
0.6
58 0
.73
0
0.7
28
0.2
11 0
.27
4
0.1
86
0.1
92
0.1
64 0
.24
3
0.3
33
0.1
80
0.1
70
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
lex lem chr4 lex lem chr4 lex lem chr4 lex lem chr4
NBM SVM SB-TopN SB-RFS
* Accuracy and f-score in white and gray columns, respectively.
The first two columns – results on the entire feature set; the second two – on N=30,000 features
The dashed line indicates higher of random and majority baselines
differences are statistically significant
100 candidate authors0
.21
3
0.1
85
0.1
92
0.1
77
0.2
58
0.2
74
0.4
35
0.3
31
0.4
13
0.3
46
0.4
67
0.4
25
0.0
83
0.1
04
0.0
94
0.0
54 0
.09
8
0.0
42
0.0
60
0.0
58
0.0
42
0.1
47
0.1
21
0.1
28
0.1
14
0.1
84
0.1
99
0.4
14
0.3
13
0.3
97
0.3
29
0.4
40
0.3
95
0.0
70
0.0
95
0.0
80
0.0
39
0.0
98
0.0
34
0.0
44
0.0
44
0.0
34
0.0
0.1
0.2
0.3
0.4
0.5
lex lem chr4 lex lem chr4 lex lem chr4 lex lem chr4
NBM SVM SB-TopN SB-RFS
differences are statistically significant
1,000 candidate authors0
.10
3
0.0
92
0.0
75
0.0
61
0.0
97
0.0
84
0.2
39
0.1
91
0.2
35
0.1
94
0.2
67
0.2
28
0.0
34
0.0
24
0.0
25
0.0
19 0.0
39
0.0
26
0.0
28
0.0
17
0.0
25
0.0
25
0.0
24
0.0
15
0.0
22
0.0
23
0.0
20
0.1
54
0.1
37 0.1
56
0.1
43
0.1
80
0.1
85
0.0
29
0.0
20
0.0
21
0.0
16
0.0
32
0.0
22
0.0
24
0.0
15
0.0
21
0.0
0.1
0.2
0.3
lex lem chr4 lex lem chr4 lex lem chr4 lex lem chr4
NBM SVM SB-TopN SB-RFS
differences are not statistically significant
Summary of results
Not all results exceed random and majority baselines:
Lemmatizer is not adjusted to cope with the non-normative language
ML > SB
chr4 > lex > lem (on the larger author sets)
Entire feature set > RFS (N = 30,000; K = 20) > TopN (N = 30,000)
Conclusions and future work
Conclusions: The first comparative AA results (in terms of methods, feature
types, dimensionality reduction techniques) for the Lithuanian language using:
Internet comments
a big author set (i.e., 1,000 candidate authors)
The best results (especially on the larger author sets) with ML + chr4
Future work: Error analysis and improvements
Expanded number of candidate authors
Different types of non-normative texts (e.g., social networks data)
Researchers who also worked on this topic
Ligita Šarkutė
KUT
Andrius Utka
VMU
Robertas Damaševičius
KUT
Algimantas Venčkauskas
KUT
Lithuanian Cybercrime Centre of
Excellence for Training,
Research and Education
(HOME/2013/ISEC/AG/INT/400005176)
Automatic extraction of style applied to
individual authors and groups of authors
(ASTRA) (LIT-8-69)
http://dangus.vdu.lt/~jkd/eng/
Thank you for your attention!
j u rg i ta . k . dz@g mai l . com