making sense of multivariate analyses of linguistic ... · §choice of features & texts...

1
www.linguistik.fau.de | www.stefan-evert.de Making sense of multivariate analyses of linguistic variation Stefan Evert COMPUTATIONAL CORPUS LINGUISTICS GROUP PROFESSUR FÜR KORPUSLINGUISTIK Multidimensional analysis (Biber 1988) § 481 texts, 67 lexico- grammatical features § unsupervised FA § validation: separation of “known” genre categories Problems § choice of features & texts § interpretation of FA weights Biber, Douglas (1988). Variation Across Speech and Writing. Cambridge University Press, Cambridge. Diwersy, Sascha; Evert, Stefan; Neumann, Stella (2014). A weakly supervised multivariate approach to the study of language variation. In: Aggregating Dialectology, Typology, and Register Analysis. Linguistic Variation in Text and Speech, pages 174–204. De Gruyter, Berlin, Boston. Evert, Stefan & Neumann, Stella (2017). The impact of translation direction on characteristics of translated texts. A multivariate analysis for English and German. In: Empirical Translation Studies. New Theoretical and Methodological Traditions, TiLSM 300, pages 47–80. Mouton de Gruyter, Berlin. Evert, Stefan; Proisl, Thomas; Jannidis, Fotis; Pielström, Steffen; Schöch, Christof; Vitt, Thorsten (2015). Towards a better understanding of Burrows's Delta in literary authorship attribution. In Proceedings of the Fourth Workshop on Computational Linguistics for Literature, pages 79–88, Denver, CO. Evert, Stefan; Proisl, Thomas; Jannidis, Fotis; Reger, Isabella; Pielström, Steffen; Schöch, Christof; Vitt, Thorsten (2017). Understanding and explaining Delta measures for authorship attribution. Digital Scholarship in the Humanities. Advance access https://doi.org/10.1093/llc/fqw046. Case study II: Evidence for shining-through in translations minimally supervised PCA (linear discriminant analysis) § 298 texts from CroCo corpus (78× ENDE, 71× DEEN) § 27 features grounded in SFL § LDA for DE vs. EN originals § position of translations evidence for shining-through Problems § interpretation of LDA weights § are weights stable or do they depend on choice of texts? § is our selection of features crucial to the results? Case study I: Authorship attribution with Burrows’s Delta unsupervised clustering § 25 authors × 3 novels for EN, DE, FR § 200 – 5000 features § Ward clustering / PAM Problems § only 75 texts § how & why does Δ B work so well? Δ B (D 1 ,D 2 )= n w X i =1 z i (D 1 ) - z i (D 2 ) 0 500 1000 1500 2000 2500 Ward clustering (English, zscores, BD, n=1000) thackeray: virginians thackeray: pendennis thackeray: esmond meredith: richmond meredith: marriage meredith: feverel lytton: kenelm lytton: novel lytton: what corelli: innocent corelli: romance corelli: satan cbronte: shirley cbronte: jane cbronte: villette blackmore: erema blackmore: springhaven blackmore: lorna eliot: felix eliot: daniel eliot: adam gaskell: wives gaskell: ruth gaskell: lovers dickens: bleak dickens: expectations dickens: oliver stevenson: catriona braddon: audley braddon: quest braddon: fortune hardy: jude hardy: tess hardy: madding ward: ashe ward: harvest collins: woman collins: basil collins: legacy barclay: rosary barclay: postern barclay: ladies forster: room forster: howards forster: angels gissing: warburton gissing: unclassed gissing: women james: ambassadors james: muse james: hudson trollope: angel trollope: phineas trollope: warden doyle: micah doyle: hound doyle: lost haggard: sheallan haggard: mist haggard: mines stevenson: arrow stevenson: island kipling: captains kipling: kim kipling: light chesterton: thursday chesterton: innocence chesterton: napoleon burnett: garden burnett: princess burnett: lord ward: milly morris: water morris: wood morris: roots 10 20 50 100 200 500 1000 2000 5000 10000 0 20 40 60 80 100 English Corpus | L2 normalization | PAM clustering number of mfw adjusted Rand index (%) Cosine Delta L 1 2 Delta Burrows (L 1 ) Delta Quadratic (L 2 ) Delta L 4 Delta Evert et al. (2015, 2017) nn / T adja / T nominal / T finites / S past / F passive / V modals / V imperatives / S interrogatives / S coordination / T subordination / T pronouns / T place adv / T time adv / T adv theme / TH text theme / TH obj theme / TH verb theme / TH subj theme / TH prep / T modal adv / T contractions / T colloquialism / T titles / T lexical density lexical TTR token / S 5 0 5 zscore = standardized relative frequency 4 2 0 2 4 0.0 0.2 0.4 0.6 0.8 discriminant score density DE: orig DE: trans EN: orig EN: trans www.stefan-evert.de/PUB/EvertNeumann2017/ Diwersy et al. (2014); Evert & Neumann (2017) DE EN orig trans standardized zscores | L2 normalization C. Brontë: Jane Eyre C. Brontë: Shirley Interpretation of dimension weights § standard approach based on magnitude and sign of weights (EN on positive side of axis) § interprets features as correlated rather than complementary § better approach: what does each feature contribute to the LDA positions of texts? § reveals entirely different patterns § correlated features help LDA to reduce within-group variance 0.2 0.0 0.2 EN / DE discriminant nn_T adja_T nominal_T finites_S past_F passive_V modals_V imperatives_S interrogatives_S coordination_T subordination_T pronouns_T place.adv_T time.adv_T adv.theme_TH text.theme_TH obj.theme_TH verb.theme_TH subj.theme_TH prep_T modal.adv_T contractions_T colloquialism_T titles_T lexical.density lexical.TTR token_S normalized feature weights nn / T () adja / T nominal / T () finites / S () past / F () passive / V () modals / V () imperatives / S () interrogatives / S () coordination / T subordination / T () pronouns / T place adv / T time adv / T adv theme / TH text theme / TH () obj theme / TH verb theme / TH subj theme / TH prep / T () modal adv / T contractions / T colloquialism / T titles / T lexical density lexical TTR token / S 1 0 1 2 1 0 1 2 DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN contribution to axis scores group DE EN DE / EN discriminant (original texts) What are the characteristic words? § supervised recursive feature elimination 233 words as features § not just mfw, but none unique to one author § with, so, t, But, And, upon, don, head, Then, looking, almost, indeed, nor, …, XXXVII (df=34), XLI (df=29), XLIII (df=26), hereabout (df=11), vilest (df=15), contours (df=9), Ecod (df=4), … § validation for DE: new novels from same authors: 97% accuracy Work in progress § contribution of features to silhouette width of clustering § assess relevance to each author § identify features responsible for mis-classifications document frequency (# novels) ward: milly ward: harvest ward: ashe haggard: mines haggard: mist haggard: sheallan gissing: unclassed gissing: warburton gissing: women chesterton: napoleon chesterton: innocence chesterton: thursday gaskell: lovers gaskell: wives gaskell: ruth trollope: warden trollope: angel trollope: phineas burnett: lord burnett: garden burnett: princess james: hudson james: muse james: ambassadors stevenson: island stevenson: arrow braddon: fortune braddon: audley braddon: quest lytton: kenelm lytton: novel lytton: what barclay: rosary barclay: ladies barclay: postern dickens: oliver stevenson: catriona dickens: expectations dickens: bleak hardy: madding hardy: jude hardy: tess eliot: adam eliot: felix eliot: daniel corelli: satan cbronte: shirley corelli: innocent corelli: romance cbronte: jane cbronte: villette collins: basil collins: legacy collins: woman kipling: kim kipling: light kipling: captains meredith: feverel meredith: marriage meredith: richmond forster: howards forster: angels forster: room blackmore: springhaven blackmore: lorna blackmore: erema morris: roots morris: wood morris: water doyle: micah doyle: hound doyle: lost thackeray: pendennis thackeray: esmond thackeray: virginians Reliability of the clustering § bootstrapping texts not applicable to clustering & high-dimen. feature space § bootstrapping features unclear § biggest factor: choice of authors (empirial study on Gutenberg archive) Bootstrapping latent dimensions § bootstrapping / cross-validation can be used to assess stability of LDA &PCA dimensions (applicable because of small # of features) § LDA axis “wobbles” by approx. 10° across folds § moderate variability of feature weights: σ < 0.05 § but positions of texts on LDA axis are stable (r = .987)

Upload: phungdiep

Post on 05-Jul-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

www.linguistik.fau.de | www.stefan-evert.de

Making sense of multivariate analyses of linguistic variation Stefan Evert

COMPUTATIONAL CORPUS LINGUISTICS GROUPPROFESSUR FÜR KORPUSLINGUISTIK

Multidimensional analysis (Biber 1988)§ 481 texts, 67 lexico-

grammatical features§ unsupervised FA§ validation: separation of

“known” genre categoriesProblems§ choice of features & texts§ interpretation of FA weights

Biber,Douglas(1988).VariationAcrossSpeechandWriting.CambridgeUniversityPress,Cambridge.Diwersy,Sascha;Evert,Stefan;Neumann,Stella(2014).Aweaklysupervisedmultivariateapproachtothestudyoflanguagevariation.In:Aggregating

Dialectology,Typology,andRegisterAnalysis.LinguisticVariationinTextandSpeech,pages174–204.DeGruyter,Berlin,Boston.Evert,Stefan&Neumann,Stella(2017).Theimpactoftranslationdirectiononcharacteristicsoftranslatedtexts.AmultivariateanalysisforEnglishand

German.In:EmpiricalTranslationStudies.NewTheoreticalandMethodologicalTraditions,TiLSM300,pages47–80.MoutondeGruyter,Berlin.Evert,Stefan;Proisl,Thomas;Jannidis,Fotis;Pielström,Steffen;Schöch,Christof;Vitt,Thorsten(2015).Towardsabetter understandingofBurrows'sDelta

inliteraryauthorshipattribution.InProceedingsoftheFourthWorkshoponComputationalLinguisticsforLiterature,pages79–88,Denver,CO.Evert,Stefan;Proisl,Thomas;Jannidis,Fotis;Reger,Isabella;Pielström,Steffen;Schöch,Christof;Vitt,Thorsten(2017). Understandingandexplaining

Deltameasuresforauthorshipattribution.DigitalScholarshipintheHumanities.Advanceaccesshttps://doi.org/10.1093/llc/fqw046.

Case study II: Evidence for shining-through in translationsminimally supervised PCA (linear discriminant analysis)§ 298 texts from CroCo corpus

(78× EN➞DE, 71× DE➞EN)§ 27 features grounded in SFL

§ LDA for DE vs. EN originals§ position of translations ➞

evidence for shining-through

Problems§ interpretation of LDA weights§ are weights stable or do they

depend on choice of texts?§ is our selection of features

crucial to the results?

Case study I: Authorship attribution with Burrows’s Deltaunsupervised clustering§ 25 authors × 3 novels for EN, DE, FR§ 200 – 5000 features§ Ward clustering / PAM

Problems§ only 75 texts§ how & why does ΔB

work so well?

�B(D1, D2) =nwX

i=1

��zi(D1)� zi(D2)��

0500

1000

1500

2000

2500

Ward clustering (English, z−scores, BD, n=1000)

thac

kera

y: v

irgin

ians

thac

kera

y: p

ende

nnis

thac

kera

y: e

smon

dm

ered

ith: r

ichm

ond

mer

edith

: mar

riage

mer

edith

: fev

erel

lytto

n: k

enel

mly

tton:

nov

elly

tton:

wha

t core

lli: i

nnoc

ent

core

lli: r

oman

ceco

relli

: sat

ancb

ront

e: s

hirle

ycb

ront

e: ja

necb

ront

e: v

illet

tebl

ackm

ore:

ere

ma

blac

kmor

e: s

prin

ghav

enbl

ackm

ore:

lorn

ael

iot:

felix

elio

t: da

niel

elio

t: ad

amga

skel

l: w

ives

gask

ell:

ruth

gask

ell:

love

rsdi

cken

s: b

leak

dick

ens:

exp

ecta

tions

dick

ens:

oliv

erst

even

son:

cat

riona

brad

don:

aud

ley

brad

don:

que

stbr

addo

n: fo

rtun

eha

rdy:

jude

hard

y: te

ssha

rdy:

mad

ding

war

d: a

she

war

d: h

arve

stco

llins

: wom

anco

llins

: bas

ilco

llins

: leg

acy

barc

lay:

rosa

ryba

rcla

y: p

oste

rnba

rcla

y: la

dies

fors

ter:

room

fors

ter:

how

ards

fors

ter:

ang

els

giss

ing:

war

burt

ongi

ssin

g: u

ncla

ssed

giss

ing:

wom

enja

mes

: am

bass

ador

sja

mes

: mus

eja

mes

: hud

son

trollo

pe: a

ngel

trollo

pe: p

hine

as trollo

pe: w

arde

ndo

yle:

mic

ahdo

yle:

hou

nddo

yle:

lost

hagg

ard:

she

alla

nha

ggar

d: m

ist

hagg

ard:

min

esst

even

son:

arr

owst

even

son:

isla

ndki

plin

g: c

apta

ins

kipl

ing:

kim

kipl

ing:

ligh

tch

este

rton

: thu

rsda

ych

este

rton

: inn

ocen

cech

este

rton

: nap

oleo

nbu

rnet

t: ga

rden

burn

ett:

prin

cess

burn

ett:

lord

war

d: m

illy

mor

ris: w

ater

mor

ris: w

ood

mor

ris: r

oots

10 20 50 100

200

500

1000

2000

5000

1000

0

020

4060

8010

0

English Corpus | L2 normalization | PAM clustering

number of mfw

adju

sted

Ran

d in

dex

(%)

Cosine DeltaL1 2−DeltaBurrows (L1) DeltaQuadratic (L2) DeltaL4−Delta

Evert et al. (2015, 2017)

●●

●●●

●●●●

●●

●●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●●●●

●●

●●

●●●

●●

● ●

● ●●

●●

●●

●●

●● ●

●●● ●

●●

●●●

●●●●●●

●●●●

●●

●●

●●●●●

●●●

● ●

●●●●●●●

●●●●●●

●●●●●●

●●

●●●●

●●

●●●●●●

●●

●●●●

●●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●●●●

●●●●

●●

●●●

●●

●●

nn /

Tad

ja /

Tno

min

al /

Tfin

ites

/ Spa

st /

Fpa

ssive

/ V

mod

als

/ Vim

pera

tives

/ S

inte

rroga

tives

/ S

coor

dina

tion

/ Tsu

bord

inat

ion

/ Tpr

onou

ns /

Tpl

ace

adv

/ Ttim

e ad

v / T

adv

them

e / T

Hte

xt th

eme

/ TH

obj t

hem

e / T

Hve

rb th

eme

/ TH

subj

them

e / T

Hpr

ep /

Tm

odal

adv

/ T

cont

ract

ions

/ T

collo

quia

lism

/ T

title

s / T

lexi

cal d

ensi

tyle

xica

l TTR

toke

n / S

−5

0

5

z−sc

ore

= st

anda

rdize

d re

lativ

e fre

quen

cy

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

discriminant score

dens

ity

DE: origDE: transEN: origEN: trans

www.stefan-evert.de/PUB/EvertNeumann2017/Diwersy et al. (2014); Evert & Neumann (2017)

● DEEN

origtrans

−0.3

−0.1

0.1

0.2

0.3

standardized z−scores | L2 normalization

C. Brontë: Jane EyreC. Brontë: Shirley

Interpretation of dimension weights§ standard approach based on

magnitude and sign of weights(EN on positive side of axis)

§ interprets features as correlatedrather than complementary

§ better approach: what does each feature contribute to the LDA positions of texts?

§ reveals entirely different patterns

§ correlated features help LDA to reduce within-group variance

−0.2

0.0

0.2

EN / D

E discriminant

nn_Tad

ja_T

nomina

l_T

finite

s_Spa

st_F

passi

ve_V

modals

_V

impe

rative

s_S

interr

ogati

ves_

S

coord

inatio

n_T

subo

rdina

tion_

T

prono

uns_

T

place

.adv_

T

time.a

dv_T

adv.t

heme_

TH

text.th

eme_

TH

obj.th

eme_

TH

verb.

theme_

TH

subj.

theme_

THpre

p_T

modal.

adv_

T

contr

actio

ns_T

colloq

uialism

_T

titles_

T

lexica

l.den

sity

lexica

l.TTR

token

_S

norm

alize

d fe

atur

e we

ight

s

−0.2

0.0

0.2

weight

nn /

T

(−) a

dja

/ T

nom

inal

/ T

(−) f

inite

s / S

(−) p

ast /

F

(−) p

assi

ve /

V

(−) m

odal

s / V

(−) i

mpe

rativ

es /

S

(−) i

nter

roga

tives

/ S

(−) c

oord

inat

ion

/ T

subo

rdin

atio

n / T

(−) p

rono

uns

/ T

plac

e ad

v / T

time

adv

/ T

adv

them

e / T

H

text

them

e / T

H

(−) o

bj th

eme

/ TH

verb

them

e / T

H

subj

them

e / T

H

prep

/ T

(−) m

odal

adv

/ T

cont

ract

ions

/ T

collo

quia

lism

/ T

title

s / T

lexi

cal d

ensi

ty

lexi

cal T

TR

toke

n / S

−1

0

1

2

−1

0

1

2

DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN

cont

ribut

ion

to a

xis

scor

es

groupDEEN

DE / EN discriminant (original texts)What are the characteristic words?§ supervised recursive feature elimination➞ 233 words as features

§ not just mfw, but none unique to one author§ with, so, t, But, And, upon, don, head,

Then, looking, almost, indeed, nor, …,XXXVII (df=34), XLI (df=29), XLIII (df=26),hereabout (df=11), vilest (df=15), contours (df=9), Ecod (df=4), …

§ validation for DE: new novels from same authors: 97% accuracy

Work in progress§ contribution of features to

silhouette width of clustering§ assess relevance to each author§ identify features responsible for

mis-classifications

document frequency (# novels)

ward

: milly

ward

: har

vest

ward

: ash

eha

ggar

d: m

ines

hagg

ard:

mis

tha

ggar

d: s

heal

lan

giss

ing:

unc

lass

edgi

ssin

g: w

arbu

rton

giss

ing:

wom

ench

este

rton:

nap

oleo

nch

este

rton:

inno

cenc

ech

este

rton:

thur

sday

gask

ell:

love

rsga

skel

l: w

ives

gask

ell:

ruth

trollo

pe: w

arde

ntro

llope

: ang

eltro

llope

: phi

neas

burn

ett:

lord

burn

ett:

gard

enbu

rnet

t: pr

ince

ssja

mes

: hud

son

jam

es: m

use

jam

es: a

mba

ssad

ors

stev

enso

n: is

land

stev

enso

n: a

rrow

brad

don:

fortu

nebr

addo

n: a

udle

ybr

addo

n: q

uest

lytto

n: k

enel

mly

tton:

nov

elly

tton:

wha

tba

rcla

y: ro

sary

barc

lay:

ladi

esba

rcla

y: p

oste

rndi

cken

s: o

liver

stev

enso

n: c

atrio

nadi

cken

s: e

xpec

tatio

nsdi

cken

s: b

leak

hard

y: m

addi

ngha

rdy:

jude

hard

y: te

ssel

iot:

adam

elio

t: fe

lixel

iot:

dani

elco

relli:

sat

ancb

ront

e: s

hirle

yco

relli:

inno

cent

core

lli: ro

man

cecb

ront

e: ja

necb

ront

e: v

illette

collin

s: b

asil

collin

s: le

gacy

collin

s: w

oman

kipl

ing:

kim

kipl

ing:

ligh

tki

plin

g: c

apta

ins

mer

edith

: fev

erel

mer

edith

: mar

riage

mer

edith

: ric

hmon

dfo

rste

r: ho

ward

sfo

rste

r: an

gels

fors

ter:

room

blac

kmor

e: s

prin

ghav

enbl

ackm

ore:

lorn

abl

ackm

ore:

ere

ma

mor

ris: r

oots

mor

ris: w

ood

mor

ris: w

ater

doyl

e: m

icah

doyl

e: h

ound

doyl

e: lo

stth

acke

ray:

pen

denn

isth

acke

ray:

esm

ond

thac

kera

y: v

irgin

ians

Silh

ouet

te w

idth

si

0.0

0.2

0.4

0.6

0.8

1.0

Silh

ouet

te w

idth

s (E

nglis

h, z−s

core

s, B

D, n

=100

0, W

ard)

n =

7525

clu

ster

s C

j

j : n

j | av

e i∈C

j s i

1 :

3 |

0.3

3

2 :

3 |

0.1

1

3 :

3 |

0.5

1

4 :

3 |

0.3

3

5 :

3 |

0.2

4

6 :

3 |

0.2

4

7 :

3 |

0.0

6

8 :

3 |

0.1

6

9 :

6 |

0.0

7

10 :

3 |

0.1

7

11 :

3 |

0.2

8

12 :

4 |

0.0

4

13 :

3 |

0.0

6

14 :

3 |

0.4

0

15 :

3 |

0.1

016

: 2

| 0

.08

17 :

3 |

0.1

6

18 :

3 |

0.1

1

19 :

3 |

0.1

9

20 :

3 |

0.1

8

21 :

3 |

0.2

2

22 :

3 |

0.1

8

23 :

3 |

0.1

824

: 2

| 0

.16

25 :

1 |

0.0

0

Reliability of the clustering§ bootstrapping texts not applicable to

clustering & high-dimen. feature space§ bootstrapping features ➞ unclear§ biggest factor: choice of authors

(empirial study on Gutenberg archive)

Bootstrapping latent dimensions§ bootstrapping / cross-validation can be used to assess stability of

LDA &PCA dimensions (applicable because of small # of features)§ LDA axis “wobbles” by approx. 10° across folds§ moderate variability of feature weights: σ < 0.05§ but positions of texts on LDA axis are stable (r = .987)