bleaching text: abstract features for cross-lingual gender ...cross-lingual gender prediction train:...
TRANSCRIPT
![Page 1: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/1.jpg)
Bleaching Text: Abstract Features for Cross-lingual
Gender Prediction.
Rob van der Goot, Nikola Ljubesic, Ian Matroos,Malvina Nissim & Barbara Plank
![Page 2: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/2.jpg)
Bleaching Text: Abstract Features for Cross-lingual
Gender Prediction.
Rob van der Goot, Nikola Ljubesic, Ian Matroos,Malvina Nissim & Barbara Plank
![Page 3: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/3.jpg)
Gender Prediction
The task of predicting gender based only on text.
![Page 4: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/4.jpg)
Gender Prediction
2000 2018
Perf
orm
ance
Open Vocabulary
![Page 5: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/5.jpg)
Gender Prediction
2000 2018
Featu
res
Open Vocabulary
![Page 6: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/6.jpg)
Gender Prediction
2000 2018
Modelin
g D
ata
sets
Open Vocabulary
![Page 7: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/7.jpg)
Gender Prediction
SVM with word/char n-grams performs best!
![Page 8: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/8.jpg)
Gender Prediction
SVM with word/char n-grams performs best!
I Winner PAN 2017 shared task on author profiling:
I Words: 1-2 grams
I Characters: 3-6 grams
![Page 9: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/9.jpg)
https://www.brewbound.com/news/power-hour-craft-beer-growth-opportunity-lies-female-consumers
https://www.craftbrewingbusiness.com/news/survey-women-drinking-beer-men-drinking-less/
https://www.nzherald.co.nz/business/news/article.cfm?c_id=3&objectid=11802831
![Page 10: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/10.jpg)
However, how would this lexicalized approach work acrossdifferent:
I time-spans
I domains
I languages???
![Page 11: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/11.jpg)
Cross-lingual Gender Prediction
I Train a model on source language(s) and evaluate ontarget language.
![Page 12: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/12.jpg)
Cross-lingual Gender Prediction
I Dataset: TwiSty corpus (Verhoeven et al., 2016) +English
I 200 tweets per user, 850 - 8,112 users per language
![Page 13: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/13.jpg)
Cross-lingual Gender Prediction
Train:
FR EN NL PT ESTest Language
50
60
70
80
90
Acc
ura
cy
![Page 14: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/14.jpg)
Cross-lingual Gender Prediction
USER Jaaa moeten we zeker doen
![Page 15: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/15.jpg)
Bleaching Text
![Page 16: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/16.jpg)
Bleaching Text
![Page 17: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/17.jpg)
Bleaching Text
Original Massacred a bag of Doritos for lunch!
![Page 18: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/18.jpg)
Bleaching Text
Original Massacred a bag of Doritos for lunch!
Freq 0 5 2 5 0 5 1 0
![Page 19: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/19.jpg)
Bleaching Text
Original Massacred a bag of Doritos for lunch!
Freq 0 5 2 5 0 5 1 0Length 09 01 03 02 07 03 06 04
![Page 20: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/20.jpg)
Bleaching Text
Original Massacred a bag of Doritos for lunch!
Freq 0 5 2 5 0 5 1 0Length 09 01 03 02 07 03 06 04PunctC w w w w w w w!
![Page 21: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/21.jpg)
Bleaching Text
Original Massacred a bag of Doritos for lunch!
Freq 0 5 2 5 0 5 1 0Length 09 01 03 02 07 03 06 04PunctC w w w w w w w!PunctA w w w w w w wp jjjj
![Page 22: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/22.jpg)
Bleaching Text
Original Massacred a bag of Doritos for lunch!
Freq 0 5 2 5 0 5 1 0Length 09 01 03 02 07 03 06 04PunctC w w w w w w w!PunctA w w w w w w wp jjjjShape ull l ll ll ull ll llx xx
![Page 23: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/23.jpg)
Bleaching Text
Original Massacred a bag of Doritos for lunch!
Freq 0 5 2 5 0 5 1 0Length 09 01 03 02 07 03 06 04PunctC w w w w w w w!PunctA w w w w w w wp jjjjShape ull l ll ll ull ll llx xxVowels cvccvccvc v cvc vc cvcvcvc cvc cvccco oooo
![Page 24: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/24.jpg)
Bleaching Text
I No tokenization
I Replace usernames and URLs
I Use concatenation of the bleached representations
I Tuned in-language
I 5-grams perform best
![Page 25: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/25.jpg)
Bleaching Text
Train:
FR EN NL PT ESTest Language
50
60
70
80
90
Acc
ura
cy
Lexicalized
Bleached
![Page 26: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/26.jpg)
Bleaching Text
Trained on all other languages:
EN NL FR PT ESTest Language
50
60
70
80A
ccura
cy
Lexicalized
Bleached
![Page 27: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/27.jpg)
Bleaching Text
Most predictive features
Male Female
1 W W W W ”W” USER E W W W2 W W W W ? 3 5 1 5 23 2 5 0 5 2 W W W W4 5 4 4 5 4 E W W W W5 W W, W W W? LL LL LL LL LX6 4 4 2 1 4 LL LL LL LL LUU7 PP W W W W W W W W *-*8 5 5 2 2 5 W W W W JJJ9 02 02 05 02 06 W W W W &W;W
10 5 0 5 5 2 J W W W W
![Page 28: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/28.jpg)
Human Experiments
I Are humans able to predict gender based only on text forunknown languages?
![Page 29: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/29.jpg)
Human Experiments
I 20 tweets per user (instead of 200)
I 6 annotators per language pair
I Each annotating 100 users
I 200 users per language pair, so 3 predictions per user
![Page 30: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/30.jpg)
Human Experiments
I 20 tweets per user (instead of 200)
I 6 annotators per language pair
I Each annotating 100 users
I 200 users per language pair, so 3 predictions per user
![Page 31: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/31.jpg)
Human Experiments
![Page 32: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/32.jpg)
Human Experiments
NL NL NL PT FR NL
Test Language
50
60
70
80
90
Acc
ura
cyLexicalized
Bleached
Humans
(note that the classifier had acces to 200 tweets)
![Page 33: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/33.jpg)
Conclusions
I Lexical models break down when used cross-language
I Bleaching text improves cross-lingual performance
I Humans performance is on par with our bleachedapproach
![Page 34: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/34.jpg)
Thanks for your attention
![Page 35: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/35.jpg)
Cross-lingual Embeddings
EN NL FR PT ESTest Language
50
60
70
80
90A
ccura
cyLexicalized
Bleached
Embeddings
See: Plank (2017) & Smith et al. (2017)
![Page 36: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/36.jpg)
Lexicalized Cross-language
Test → EN NL FR PT ES
Train
EN 52.8 48.0 51.6 50.4NL 51.1 50.3 50.0 50.2FR 55.2 50.0 58.3 57.1PT 50.2 56.4 59.6 64.8ES 50.8 50.1 55.6 61.2
Avg 51.8 52.3 53.4 55.3 55.6
![Page 37: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/37.jpg)
In-language performance
EN NL FR PT ESTest Language
60
70
80
90A
ccura
cy
Lexicalized
Bleached
![Page 38: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/38.jpg)
Bleached + Lexicalized
EN NL FR PT ESTest Language
50
60
70
80A
ccura
cy
Bleached
Bleached+lex
![Page 39: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/39.jpg)
Unigrams vs fivegrams
EN NL FR PT ESTest Language
50
60
70
80A
ccura
cy
Unigram
Fivegram
![Page 40: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/40.jpg)
Number of unique unigrams for Dutch
Feature Size
Lexicalized 281011Bleached 54103
Frequency 8Length 79PunctAgr 107PunctCons 5192Shape 2535Vowels 46198
![Page 41: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction](https://reader034.vdocuments.site/reader034/viewer/2022050219/5f64cd7af900e330e1381aed/html5/thumbnails/41.jpg)
Language to language feature analysis
45
50
55
60
65
70
EN
EN NL FR PT ES
45
50
55
60
65
NL
45
50
55
60
65
FR
TR
AIN
45
50
55
60
65
PT
45
50
55
60
65
ES
LegendvowelsshapepunctCpunctAlengthfrequencyall
TEST