comparing reconstruction methods for linguistic phylogenies
TRANSCRIPT
Comparing reconstruction methods for
linguistic phylogenies
Collaborators: Francois Barbancon, Don Ringe, Luay Nakhleh, Tandy Warnow
Steven N. EvansU.C. Berkeley Statistics, Mathematics, and Graduate Group in Computational
and Genomic Biology
Biological phylogeny
Orangutan Gorilla Chimpanzee Human
From the Tree of the Life Website,University of Arizona
IndoEuropean Tree(Ringe, Warnow and Taylor
2000)
DNA Sequence Evolution
AAGACTT
TGGACTTAAGGCCT
3 mil yrs
2 mil yrs
1 mil yrs
today
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT
AGGGCAT TAGCCCT AGCACTT
AAGACTT
TGGACTTAAGGCCT
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT
Phylogeny Problem
TAGCCCA TAGACTT TGCACAA TGCGCTTAGGGCAT
U V W X Y
U
V W
X
Y
Some useful terminology: homoplasy
0 0 1 1
0
0
0
0 0 0 0 1
0
0
0
0 1 0 1 1
0
1
0
1
1 1 1
no homoplasy backmutation parallel evolution
What plays the role of DNA nucleotides
in linguistic phylogeny?
HANDOUT
N.B. This is NOT the correct thing to do for all reconstruction methods. It is correct for the parsimony and compatibility methods considered in the handout – see later.
How do we infer phylogenetic trees once we
have the data?
Some useful terminology:lexical clock
A B C
A
B
DD
C
lexical clock no lexical clock
edge lengths represent expected numbers of substitutions
Distancebased Phylogenetic Methods
Locating the root requires additional information
UPGMA (unweighted pair grouping method of agglomeration) finds the pair of languages (say L1 and L2) which have the smallest distance, and makes them siblings.
It treats L1 and L2 as a single language, leaving a problem with one fewer language.
It keeps pairing off siblings until there is only one language and then “unpacks” – recursion.
UPGMA works well when the evolutionary process obeys the lexical clock assumption – often used in glottochronology.
Neighbor joining (NJ) can reconstruct accurate phylogenies in the absence of a lexical clock (again through successive pairing).
Given a distance matrix which is exactly correct, NJ will return the correct tree.
A difference between NJ and UPGMA is that without a lexical clock, the pair of languages which have the smallest distance may not be siblings in the true tree. In this case, UPGMA will incorrectly join the nearest pair, but NJ will not.
NJ is believed to be one of the best distancebased methods.
Maximum Compatibility (MC) seeks a tree on which the maximum number of characters are compatible = evolve without homoplasy.
C1, C2 compatibleC1, C2, C3 incompatible
Weighted Maximum Compatibility (WMC) gives weights for each character, so that characters that are considered to be more resistant to homoplasy are given higher weight.
We seek a tree which has the smallest total weight of incompatible characters
Maximum Parsimony (MP) seeks a tree on which a minimum number of character state changes occurs.
MP is one of the most important and frequently used methods in molecular phylogenetics.
Maximum Parsimony
ACT
GTT
GTT GTA
ACA
GTA
12
2
MP score = 5
ACA ACT
GTAGTT
ACA ACT
3 1 3
MP score = 7
ACT
ACA
GTT
GTAACA GTA
1 2 1
MP score = 4
Optimal MP tree
Maximum Likelihood (ML)
• Assumes a stochastic model of evolution (including numeric parameters for substitution rates, substitution probabilities, etc.)
• Finds the tree and parameters that maximize the probability of the data.
Bayesian methods
• Assumes a stochastic model of evolution.• Combines a prior probability distribution on
the correct tree and numeric parameters with data using Bayes’ rule to get a posterior probability distribution.
• The posterior is usually approximated using Markov chain MonteCarlo (MCMC).
• Models in linguistics due to Gray & Atkinson, Nicholls.
The performance of methods on an IE data set (Nakhleh et al. 2005)
Screened dataset of 259 lexical, 13 morphological, 22 phonological chars
Unscreened dataset of 297 lexical, 17 morphological, 22 phonological chars (remove homoplasious or borrowed chars)
Some observations• UPGMA does the worst (e.g. splits Italic (Latin, Oscan,
Umbrian) and Iranian (Avestan, Persian) groups). • Other than UPGMA, all methods reconstruct the ten
major subgroups, Anatolian (Hittite, Luvian, Lycian) + Tocharian, and Greek + Armenian.
• The Satem Core (IndoIranian, BaltoSlavic) is not always reconstructed.
• The only analyses that do not put Italic and Celtic with Germanic are weighted maximum compatibility on the full datasets with high weights on morphological characters.
• When using lexical data only, all methods group Italic, Celtic, and Germanic together.
• Methods differ significantly on the datasets.
GA = Gray+Atkinson Bayesian MCMC method
WMC = weighted maximum compatibility
MC = maximum compatibility
NJ = neighbor joining
UPGMA = UPGMA
*
Different methods/datagive different answers.
We don’t know which answer is correct.Which method(s)/data
should we use?
Simulation study• Simulated evolution down networks (“genetic
trees” plus contact edges along which borrowing can occur) with 30 languages for 300 lexical characters and 60 morphological characters.
• Methods compared: UPGMA, NJ, weighted and unweighted MP, weighted MC and G+A.
• Lexical weight=1, morphological weight=50.• Compare trees constructed by various methods
to the “genetic tree” contained in the network, for topological accuracy (= distance from correct tree – see later).
Some features of the data can make reconstruction
easier or harder.
Our simulation varies these in a
CONTROLLABLE manner.
Homoplasy
0 0 1 1
0
0
0
0 0 0 0 1
0
0
0
0 1 0 1 1
0
1
0
1
1 1 1
no homoplasy backmutation parallel evolution
Lexical clock
A B C
A
B
DD
C
lexical clock no lexical clock
edge lengths represent expected numbers of substitutions
Ratesacrosssites
AB
C
AB
If a character evolves twice as fast as another on one edge, it evolves twice as fast on every edge.
D
CD
Heterotachy = departure from ratesacrosssites
The underlying tree is fixed, but there are no constraints on edge length variations between characters.
AB
C
D
A
B
C
D
A B C D E
A B C D E A B C D E A B C D E
Trees in networks borrowing
Quantifying Topological Error
FN: false negative (missing edge)FP: false positive (incorrect edge)
FN
FP
Impact of homoplasy for characters evolved down a tree under a moderate deviation from a lexical clock and moderate heterotachy. Our weighting is inappropriate for “unscreened” data.
Impact of homoplasy for characters evolved down a network with three contact edges under a moderate deviation from the lexical clock and moderate heterotachy.
Impact of the deviation from a lexical clock for characters evolved down a network with three contact edges under low levels of homoplasy and with moderate heterotachy. We vary the deviation from a lexical clock from low to moderate.
Impact of heterotachy for characters evolved down a network with three contact edges, with low homoplasy, and with moderate deviation from a lexical clock. Heterotachy increases with the parameter.
Impact of data selection for characters evolved down a network with three contact edges, under low homoplasy (``screened data"), moderate deviation from a lexical clock, and moderate heterotachy.
Impact of the number of contact edges for characters evolved under low homoplasy, moderate deviation from a lexical clock, and moderate heterotachy.
Impact of homoplasy & number of contact edges
Impact of homoplasy & heterotachy
Impact of homoplasy & deviation from a lexical clock
Conclusions and comments
• Choice of reconstruction method does matter.
• Relative performance between methods is quite stable.
• Moderate homoplasy is not a problem.• Presence of some borrowing is not a
problem.• Some amount of heterotachy helps!
Future research
• We need more investigation of methods based on stochastic models (Bayesian beyond G+A, maximum likelihood, NJ with better distance corrections), as these are now the methods of choice in biology. This requires better models of linguistic evolution and hence input from linguists!
Future research (continued)
• Should we screen? The simulation uses low homoplasy as a proxy for screening, but real screening throws away data and may introduce bias.
• How do we detect/reconstruct borrowing?• How do we handle missing data in
methods based on stochastic models?• How do we handle polymorphism?
“What song the Sirens sang, or what name Achilles assumed when he hid himself among women, though puzzling questions, are not beyond all conjecture.” Thomas Browne