contextual*featuresfor* intertext*discovery* · beyondverbal’ identy ’ contextual*featuresfor*...

1
Beyond Verbal Iden.ty Contextual Features for Intertext Discovery C. W. Forstall 1 , L. Galli Milić 1 , N. Coffee 2 and D. Nelis 1 1. Université de Genève 2. University at Buffalo, the State University of New York MoJvaJon We use the textreuse detecJon tool Tesserae to locate potenJally interesJng allusions in Flavian epic poetry. While themaJc resemblance at the scene level is oQen important to establishing the connecJon between two passages and thus the significance of an allusion, Tesserae presently focuses on localized reuse of specific phrases and may miss higherlevel contextual cues. We are tesJng the viability of largerscale, “themaJc” features targeted at the scene or paragraph level. Our goal is to modify the rankings of verbal correspondences idenJfied by Tesserae according to the similarity of the respecJve contents of the phrases. For example, the pair of phrases below was ranked 379th of 912 results by Tesserae, but in the context of systemaJc structural similarity (see right), otherwise lowerranking textreuse becomes more interesJng. Valerius Flaccus, Argonau(ca 5 5.170a 5.70b176 5.177216 5.217277 5.278295 5.296328 [BOOK DIVISION] Mariandyni; death and burial of Idmon and Tiphys; Erginus chosen as helmsman. Departure, voyage along southern coast of Black Sea; Argonauts pass the Chalybes, Carambis and Prometheus. Evening and arrival in the Phasis. Prayer of Jason. InvocaJon of a Muse (dea) and the situaJon in Colchis. Divine intervenJon: Juno and Minerva. War. Argonauts make their way to the city and palace of Aietes. Methods Corpus and text preparaJon Our corpus was primarily epic, enlarged to include Ovid’s Heroides and Seneca’s Medea, which we felt might show affiniJes of style and content to our text of interest, Valerius Flaccus’ Argonau9ca. Each sample was 30 lines of text. Iinflected forms were reduced to lemmata, using methods comparable to those in Tesserae. All preprocessing and subsequent analysis was done using R, with the help of the cluster, mclust, tm and topicmodels packages. Unsupervised classificaJon We used kmeans clustering to search for stable clusters of passages that shared similar language across works. Clustering was performed on two different feature sets: 1) TFIDF weighted scores for all the words in the corpus common to two or more 30line samples. Each sample was represented by a vector of approximately 8,000 frequencies. 2) A set of 50 topics generated using Latent Derichlet allocaJon (LDA). Each sample was represented by 50 values, represenJng its scores for each of the topics. CorrelaJon between clusterings We tested correlaJon between the clusters generated by kmeans using the adjusted rand index. This gives, for two classificaJons, a measure of their correlaJon above what is expected by chance. The box plot at right shows correlaJon between kmeans clustering and true authorship, over 10 repeJJons of the clustering for each treatment: midf scores on the leQ, and LDA topic scores on the right. We chose k = 11, the number of authors in the corpus. LDA was effecJve at reducing the otherwise significant impact of authorship on the classificaJon. Cluster stability We varied k, the number of classes, from 2 to 12, and for each value of k we generated 15 clusterings. Adjusted rand indices were calculated for each of 105 possible pairs of clusters for a given value of k. The distribuJons of these (right) provide an indicaJon of the stability of each configuraJon of classes: small numbers of classes are highly stable; among larger values of k, divisions into 6 and 7 classes are most stable. 2 3 4 5 6 7 8 9 10 11 12 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 lda 50 topics classes adjusted rand score Vergil, Aeneid 7 7.17 7.824 7.2536 7.37106 7.107147 7.148285 7.286640 [BOOK DIVISION] Death and burial of Caieta; departure. Voyage along the coast; Trojans pass Circe’s land. Dawn and arrival in Tiber. InvocaJon of the Muse Erato and the situaJon in LaJum. Meal. Prayer of Aeneas; sacrifice. Trojans make their way to the city and palace of LaJnus. Divine intervenJon: Juno and Allecto. War. ThemaJc similarity We see similar themaJc elements in the openings of Aeneid 7 and Valerius Flaccus’ ArgonauJca 5, in both cases at (what was likely) the midpoint of the narraJve. Pairwise adjusted rand index randscores.giant.test Frequency 0.2 0.4 0.6 0.8 0 500 1000 1500 2000 Below: a closeup showing only Vergil’s Aeneid and Valerius Flaccus’ Argonau9ca. This is the type of result that we are looking for: samples fall into mulJple classes and are not segregated by author. The figures above show the author effect graphically: for the TFIDF features the disJnctness of authors such as Ovid, Seneca, Lucan, Silius Italicus and Corippus from the central cloud is apparent. Under the LDA treatment, only Ovid maintained the same degree of separaJon. The effects of authorship Topic Stability To test the stability of LDA, we generated 100 different LDA models of 50 topics, performing kmeans clustering on each one with k = 7. The figure at right shows the distribuJon of adjusted rand index values for 4950 pairwise comparisons between the 100 classificaJons produced. CorrelaJon is consistent but low, at around 0.25, with one or two outlier cases having high agreement. Sample results Above: book 7 of the Aeneid. The first half of the book, which features more peaceful content, alternates between classes 6 and 7, the most general of the epic classes. The preparaJons for war in the book’s second half group with class 2. Two passages affiliate with more authorspecific groups: Juno’s speech at 286 falls in the group dominated by Ovid’s Metamorphoses, while the single brief baple scene groups with Lucan’s Civil War in class 3. 2 3 4 5 6 7 Vergil Aeneid 7 first verse of sample class 6.887 7.16 7.46 7.76 7.106 7.136 7.166 7.196 7.226 7.256 7.286 7.316 7.346 7.376 7.406 7.436 7.466 7.496 7.526 7.556 7.586 7.616 7.646 7.676 7.706 7.736 7.766 7.796 tfidf lda 0.2 0.3 0.4 0.5 0.6 0.7 correlation with authorship adjusted rand index Below we show one example of kmeans clustering into 7 classes, taken from the topic stability experiments described above. Point size shows how oQen, in 100 different tests, each sample fell into the class shown here. F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V VV V V V V V V V 5 0 5 10 5 0 5 10 Closeup: Vergil vs. Valerius Flaccus PC1 PC2 classification class 1 class 2 class 3 class 4 class 5 class 6 class 7 F V authorship valerius_flaccus vergil 15 10 5 0 5 10 15 10 5 0 5 10 15 Effects of authorship TFIDF by author PC1 PC2 baebius_italicus catullus corippus ennius lucan ovid seneca silius_italicus statius valerius_flaccus vergil 15 10 5 0 5 4 2 0 2 4 Effects of authorship LDA by author PC1 PC2 baebius_italicus catullus corippus ennius lucan ovid seneca silius_italicus statius valerius_flaccus vergil 15 10 5 0 5 10 15 10 5 0 5 10 15 kmeans classification of lda PC1 PC2 classification class 1 class 2 class 3 class 4 class 5 class 6 class 7 authorship baebius_italicus catullus corippus ennius lucan ovid seneca silius_italicus statius valerius_flaccus vergil … etenim dat candida certam nox Helicen. (Val. Flac. 5.70) adspirant aurae in noctem nec candida cursus luna negat, splendet tremulo sub lumine pontus. (Verg. Aen. 7.8)

Upload: others

Post on 26-Jun-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Contextual*Featuresfor* Intertext*Discovery* · BeyondVerbal’ Identy ’ Contextual*Featuresfor* Intertext*Discovery* C.*W.*Forstall1,*L.*Galli*Milić1,N.*Coffee 2*and*D.*Nelis1

Beyond  Verbal  Iden.ty  Contextual  Features  for  Intertext  Discovery  

C.  W.  Forstall1,  L.  Galli  Milić1,  N.  Coffee2  and  D.  Nelis1  1.  Universite  de  Genève          2.  University  at  Buffalo,  the  State  University  of  New  York  

MoJvaJon    We  use  the  text-­‐reuse  detecJon  tool  Tesserae  to  locate  potenJally  interesJng  allusions  in  Flavian  epic  poetry.  While  themaJc  resemblance  at  the  scene  level  is  oQen  important  to  establishing  the  connecJon  between  two  passages  and  thus  the  significance  of  an  allusion,  Tesserae  presently  focuses  on  localized  re-­‐use  of  specific  phrases  and  may  miss  higher-­‐level  contextual  cues.    We  are  tesJng  the  viability  of  larger-­‐scale,  “themaJc”  features  targeted  at  the  scene  or  paragraph  level.  Our  goal  is  to  modify  the  rankings  of  verbal  correspondences  idenJfied  by  Tesserae  according  to  the  similarity  of  the  respecJve  contents  of  the  phrases.    For  example,  the  pair  of  phrases  below  was  ranked  379th  of  912  results  by  Tesserae,  but  in  the  context  of  systemaJc  structural  similarity  (see  right),  otherwise  lower-­‐ranking  text-­‐reuse  becomes  more  interesJng.    

Valerius  Flaccus,  Argonau(ca  5  

   5.1-­‐70a        5.70b-­‐176          5.177-­‐216      5.217-­‐277      5.278-­‐295      5.296-­‐328  

[BOOK  DIVISION]    Mariandyni;  death  and  burial  of  Idmon  and  Tiphys;  Erginus  chosen  as  helmsman.    Departure,  voyage  along  southern  coast  of  Black  Sea;  Argonauts  pass  the  Chalybes,  Carambis  and  Prometheus.    Evening  and  arrival  in  the  Phasis.  Prayer  of  Jason.    InvocaJon  of  a  Muse  (dea)  and  the  situaJon  in  Colchis.    Divine  intervenJon:  Juno  and  Minerva.  War.    Argonauts  make  their  way  to  the  city  and  palace  of  Aietes.      

Methods    Corpus  and  text  preparaJon    Our  corpus  was  primarily  epic,  enlarged  to  include  Ovid’s  Heroides  and  Seneca’s  Medea,  which  we  felt  might  show  affiniJes  of  style  and  content  to  our  text  of  interest,  Valerius  Flaccus’  Argonau9ca.      Each  sample  was  30  lines  of  text.  Iinflected  forms  were  reduced  to  lemmata,  using  methods  comparable  to  those  in  Tesserae.    All  preprocessing  and  subsequent  analysis  was  done  using  R,  with  the  help  of  the  cluster,  mclust,  tm  and  topicmodels  packages.    Unsupervised  classificaJon    We  used  k-­‐means  clustering  to  search  for  stable  clusters  of  passages  that  shared  similar  language  across  works.  Clustering  was  performed  on  two  different  feature  sets:  1)  TF-­‐IDF  weighted  scores  for  all  the  words  in  the  corpus  

common  to  two  or  more  30-­‐line  samples.  Each  sample  was  represented  by  a  vector  of  approximately  8,000  frequencies.  

2)  A  set  of  50  topics  generated  using  Latent  Derichlet  allocaJon  (LDA).  Each  sample  was  represented  by  50  values,  represenJng  its  scores  for  each  of  the  topics.  

CorrelaJon  between  clusterings    We  tested  correlaJon  between  the  clusters  generated  by  k-­‐means  using  the  adjusted  rand  index.  This  gives,  for  two  classificaJons,  a  measure  of  their  correlaJon  above  what  is  expected  by  chance.      

 The  box  plot  at  right  shows  correlaJon  between  k-­‐means  clustering  and  true  authorship,  over  10  repeJJons  of  the  clustering  for  each  treatment:  m-­‐idf  scores  on  the  leQ,  and  LDA  topic  scores  on  the  right.    We  chose  k  =  11,  the  number  of  authors  in  the  corpus.      LDA  was  effecJve  at  reducing  the  otherwise  significant  impact  of  authorship  on  the  classificaJon.    

Cluster  stability    We  varied  k,  the  number  of  classes,  from  2  to  12,  and  for  each  value  of  k  we  generated  15  clusterings.  Adjusted  rand  indices  were  calculated  for  each  of  105  possible  pairs  of  clusters  for  a  given  value  of  k.    The  distribuJons  of  these  (right)  provide  an  indicaJon  of  the  stability  of  each  configuraJon  of  classes:  small  numbers  of  classes  are  highly  stable;  among  larger  values  of  k,  divisions  into  6  and  7  classes  are  most  stable.  

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●

●●●●

●●

●●●●

2 3 4 5 6 7 8 9 10 11 12

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

lda 50 topics

classes

adju

sted

rand

sco

re

Vergil,  Aeneid  7  

   7.1-­‐7      7.8-­‐24      7.25-­‐36    7.37-­‐106      7.107-­‐147    7.148-­‐285      7.286-­‐640      

[BOOK  DIVISION]    Death  and  burial  of  Caieta;  departure.    Voyage  along  the  coast;  Trojans  pass  Circe’s  land.    Dawn  and  arrival  in  Tiber.      InvocaJon  of  the  Muse  Erato  and  the  situaJon  in  LaJum.    Meal.  Prayer  of  Aeneas;  sacrifice.    Trojans  make  their  way  to  the  city  and  palace  of  LaJnus.    Divine  intervenJon:  Juno  and  Allecto.  War.  

ThemaJc  similarity    We  see  similar  themaJc  elements  in  the  openings  of  Aeneid  7  and  Valerius  Flaccus’  ArgonauJca  5,  in  both  cases  at  (what  was  likely)  the  mid-­‐point  of  the  narraJve.    

Pairwise adjusted rand index

randscores.giant.test

Frequency

0.2 0.4 0.6 0.8

0500

1000

1500

2000

Below:  a  close-­‐up  showing  only  Vergil’s  Aeneid  and  Valerius  Flaccus’  Argonau9ca.  This  is  the  type  of  result  that  we  are  looking  for:  samples  fall  into  mulJple  classes  and  are  not  segregated  by  author.  

The  figures  above  show  the  author  effect  graphically:  for  the  TF-­‐IDF  features  the  disJnctness  of  authors  such  as  Ovid,  Seneca,  Lucan,  Silius  Italicus  and  Corippus  from  the  central  cloud  is  apparent.  Under  the  LDA  treatment,  only  Ovid  maintained  the  same  degree  of  separaJon.  

The  effects  of  authorship  

Topic  Stability    To  test  the  stability  of  LDA,  we  generated  100  different  LDA  models  of  50  topics,  performing  k-­‐means  clustering  on  each  one  with  k  =  7.    The  figure  at  right  shows  the  distribuJon  of  adjusted  rand  index  values  for  4950  pairwise  comparisons  between  the  100  classificaJons  produced.  CorrelaJon  is  consistent  but  low,  at  around  0.25,  with  one  or  two  outlier  cases  having  high  agreement.  

Sample  results  

Above:  book  7  of  the  Aeneid.  The  first  half  of  the  book,  which  features  more  peaceful  content,  alternates  between  classes  6  and  7,  the  most  general  of  the  epic  classes.  The  preparaJons  for  war  in  the  book’s  second  half  group  with  class  2.  Two  passages  affiliate  with  more  author-­‐specific  groups:  Juno’s  speech  at  286  falls  in  the  group  dominated  by  Ovid’s  Metamorphoses,  while  the  single  brief  baple  scene  groups  with  Lucan’s  Civil  War  in  class  3.  

23

45

67

Vergil Aeneid 7

first verse of sample

clas

s

6.88

77.

167.

467.

767.

106

7.13

67.

166

7.19

67.

226

7.25

67.

286

7.31

67.

346

7.37

67.

406

7.43

67.

466

7.49

67.

526

7.55

67.

586

7.61

67.

646

7.67

67.

706

7.73

67.

766

7.79

6

tf−idf lda

0.2

0.3

0.4

0.5

0.6

0.7

correlation with authorship

adju

sted

rand

inde

x

Below  we  show  one  example  of  k-­‐means  clustering  into  7  classes,  taken  from  the  topic  stability  experiments  described  above.  Point  size  shows  how  oQen,  in  100  different  tests,  each  sample  fell  into  the  class  shown  here.  

FF

FF

F

F

FF

F

F

FF

FFF

F

FF F

FFFF

F

FF

F

F

F FF F

F

F

F

FF

F

F

FFF

FFF

F

F

FF FF F

FF

F

F

FF

FF

F

FF F

F

F FF

F F

FF

F F

FF F F

F

FF

FF F

F

F

FF

F

FF

F F

F

FF

F

F

FF

FFF

F

F

FF

FF

F

F

FF

F

FF

F

F

F

FF

F

F

FF

FF

F

F

FF

F

F

FF

F

F

FF

F

FF

F

F

FF

F

F

FFF F

FF

FFF

F

F

F

F

FF

F

FF

FF F FF

F

F

F FF

F

FF

FF

FF

FFF V

VV

VV

V

V

VVV

V

V V

VV

V

V

VV

V

V

V

V

VV

V V

V

VVV V

V

V

V

V

V

V VV

VV

VVV

VV

VV V

VV

V

V

VV

VVV

V

V

V

V

V V

V

V

VV

V

V V

V

VV

V

V VV

V

V

VV

V

V

V

VV

VVV

VV

V

V

V

VV

VVVV

VV

V

V

VV VV

VVV

VV

V

V

VV

V

V

VV

V

V

V

VVV

V

V V

VV

V

VV

V

V

VV V

V

VV

V

VV

VVVV

V VV

VVV

V VV V

V VV

V

VV

VV

V

V

V

VV

V

V

VV

V

V

V V

V

VV

VV

V

VV

V V

V

VV

V

V

V

V

V

V

V

VV

VVV

VVV

V

VV

V VVVV

V

VV VV

V VVV

V

V

V

V

V

V

V V

V

V

VVV

V

V

V

V

V

V V V

V

V

VV

V

V

V V

V

V

V

V VV

VVV V VVV

V

V

VV

V

VV

VV

V

VV V

VV

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

VV

V

V

V

V V

VV

VV

V

VV

V

V

V

V

V

V

V VV

V

V

V

VV

VV

−5 0 5 10

−50

510

Close−up: Vergil vs. Valerius Flaccus

PC1

PC2

classificationclass 1class 2class 3class 4class 5class 6class 7

FV

authorshipvalerius_flaccusvergil

●●

●● ●

●●

●●●●

●●● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

● ●

●● ● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●●

●●●

●●

●●

●●

●●

●●

−15 −10 −5 0 5 10 15

−10

−50

510

15

Effects of authorshipTF−IDF by author

PC1

PC2

baebius_italicuscatulluscorippusenniuslucanovidsenecasilius_italicusstatiusvalerius_flaccusvergil

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●●●

●●

●●

● ●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

−15 −10 −5 0 5

−4−2

02

4

Effects of authorshipLDA by author

PC1

PC2

baebius_italicuscatulluscorippusenniuslucanovidsenecasilius_italicusstatiusvalerius_flaccusvergil

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

−15 −10 −5 0 5 10 15

−10

−50

510

15

k−means classification of lda

PC1

PC2

classificationclass 1class 2class 3class 4class 5class 6class 7

authorshipbaebius_italicuscatulluscorippusenniuslucanovidsenecasilius_italicusstatiusvalerius_flaccusvergil

…  etenim  dat  candida  certam    nox  Helicen.  

 (Val.  Flac.  5.70)      adspirant  aurae  in  noctem  nec  candida  cursus  luna  negat,  splendet  tremulo  sub  lumine  pontus.  

 (Verg.  Aen.  7.8)