comparing reconstruction methods for linguistic phylogenies

Comparing reconstruction methods for

linguistic phylogenies

Collaborators: Francois Barbancon, Don Ringe, Luay Nakhleh, Tandy Warnow

Steven N. EvansU.C. Berkeley Statistics, Mathematics, and Graduate Group in Computational

and Genomic Biology

Biological phylogeny

Orangutan Gorilla Chimpanzee Human

From the Tree of the Life Website,University of Arizona

IndoEuropean Tree(Ringe, Warnow and Taylor

2000)

DNA Sequence Evolution

AAGACTT

TGGACTTAAGGCCT

3 mil yrs

2 mil yrs

1 mil yrs

today

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT


AAGACTT

TGGACTTAAGGCCT


AAGGCCT TGGACTT

AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Phylogeny Problem

TAGCCCA TAGACTT TGCACAA TGCGCTTAGGGCAT

U V W X Y

U

V W

X

Y

Some useful terminology: homoplasy

0 0 1 1

0

0

0

0 0 0 0 1

0

0

0

0 1 0 1 1

0

1

0

1

1 1 1

no homoplasy backmutation parallel evolution

What plays the role of DNA nucleotides

in linguistic phylogeny?

HANDOUT

N.B. This is NOT the correct thing to do for all reconstruction methods. It is correct for the parsimony and compatibility methods considered in the handout – see later.

How do we infer phylogenetic trees once we

have the data?

Some useful terminology:lexical clock

A B C

A

B

DD

C

lexical clock no lexical clock

edge lengths represent expected numbers of substitutions

Distancebased Phylogenetic Methods

Locating the root requires additional information

UPGMA (unweighted pair grouping method of agglomeration) finds the pair of languages (say L1 and L2) which have the smallest distance, and makes them siblings.

It treats L1 and L2 as a single language, leaving a problem with one fewer language.

It keeps pairing off siblings until there is only one language and then “unpacks” – recursion.

UPGMA works well when the evolutionary process obeys the lexical clock assumption – often used in glottochronology.

Neighbor joining (NJ) can reconstruct accurate phylogenies in the absence of a lexical clock (again through successive pairing).

Given a distance matrix which is exactly correct, NJ will return the correct tree.

A difference between NJ and UPGMA is that without a lexical clock, the pair of languages which have the smallest distance may not be siblings in the true tree. In this case, UPGMA will incorrectly join the nearest pair, but NJ will not.

NJ is believed to be one of the best distancebased methods.

Maximum Compatibility (MC) seeks a tree on which the maximum number of characters are compatible = evolve without homoplasy.

C1, C2 compatibleC1, C2, C3 incompatible

Weighted Maximum Compatibility (WMC) gives weights for each character, so that characters that are considered to be more resistant to homoplasy are given higher weight.

We seek a tree which has the smallest total weight of incompatible characters

Maximum Parsimony (MP) seeks a tree on which a minimum number of character state changes occurs.

MP is one of the most important and frequently used methods in molecular phylogenetics.

Maximum Parsimony

ACT

GTT

GTT GTA

ACA

GTA

12

2

MP score = 5

ACA ACT

GTAGTT

ACA ACT

3 1 3

MP score = 7

ACT

ACA

GTT

GTAACA GTA

1 2 1

MP score = 4

Optimal MP tree

Maximum Likelihood (ML)

• Assumes a stochastic model of evolution (including numeric parameters for substitution rates, substitution probabilities, etc.)

• Finds the tree and parameters that maximize the probability of the data.

Bayesian methods

• Assumes a stochastic model of evolution.• Combines a prior probability distribution on

the correct tree and numeric parameters with data using Bayes’ rule to get a posterior probability distribution.

• The posterior is usually approximated using Markov chain MonteCarlo (MCMC).

• Models in linguistics due to Gray & Atkinson, Nicholls.

The performance of methods on an IE data set (Nakhleh et al. 2005)

Screened dataset of 259 lexical, 13 morphological, 22 phonological chars

Unscreened dataset of 297 lexical, 17 morphological, 22 phonological chars (remove homoplasious or borrowed chars)

Some observations• UPGMA does the worst (e.g. splits Italic (Latin, Oscan,

Umbrian) and Iranian (Avestan, Persian) groups). • Other than UPGMA, all methods reconstruct the ten

major subgroups, Anatolian (Hittite, Luvian, Lycian) + Tocharian, and Greek + Armenian.

• The Satem Core (IndoIranian, BaltoSlavic) is not always reconstructed.

• The only analyses that do not put Italic and Celtic with Germanic are weighted maximum compatibility on the full datasets with high weights on morphological characters.

• When using lexical data only, all methods group Italic, Celtic, and Germanic together.

• Methods differ significantly on the datasets.

GA = Gray+Atkinson Bayesian MCMC method

WMC = weighted maximum compatibility

MC = maximum compatibility

NJ = neighbor joining

UPGMA = UPGMA

*

Different methods/datagive different answers.

We don’t know which answer is correct.Which method(s)/data

should we use?

Simulation study• Simulated evolution down networks (“genetic

trees” plus contact edges along which borrowing can occur) with 30 languages for 300 lexical characters and 60 morphological characters.

• Methods compared: UPGMA, NJ, weighted and unweighted MP, weighted MC and G+A.

• Lexical weight=1, morphological weight=50.• Compare trees constructed by various methods

to the “genetic tree” contained in the network, for topological accuracy (= distance from correct tree – see later).

Some features of the data can make reconstruction

easier or harder.

Our simulation varies these in a

CONTROLLABLE manner.

Homoplasy

0 0 1 1

0

0

0

0 0 0 0 1

0

0

0

0 1 0 1 1

0

1

0

1

1 1 1

no homoplasy backmutation parallel evolution

Lexical clock

A B C

A

B

DD

C

lexical clock no lexical clock

edge lengths represent expected numbers of substitutions

Ratesacrosssites

AB

C

AB

If a character evolves twice as fast as another on one edge, it evolves twice as fast on every edge.

D

CD

Heterotachy = departure from ratesacrosssites

The underlying tree is fixed, but there are no constraints on edge length variations between characters.

AB

C

D

A

B

C

D

A B C D E

A B C D E A B C D E A B C D E

Trees in networks borrowing

Quantifying Topological Error

FN: false negative (missing edge)FP: false positive (incorrect edge)

FN

FP

Impact of homoplasy for characters evolved down a tree under a moderate deviation from a lexical clock and moderate heterotachy. Our weighting is inappropriate for “unscreened” data.

Impact of homoplasy for characters evolved down a network with three contact edges under a moderate deviation from the lexical clock and moderate heterotachy.

Impact of the deviation from a lexical clock for characters evolved down a network with three contact edges under low levels of homoplasy and with moderate heterotachy. We vary the deviation from a lexical clock from low to moderate.

Impact of heterotachy for characters evolved down a network with three contact edges, with low homoplasy, and with moderate deviation from a lexical clock. Heterotachy increases with the parameter.

Impact of data selection for characters evolved down a network with three contact edges, under low homoplasy (``screened data"), moderate deviation from a lexical clock, and moderate heterotachy.

Impact of the number of contact edges for characters evolved under low homoplasy, moderate deviation from a lexical clock, and moderate heterotachy.

Impact of homoplasy & number of contact edges

Impact of homoplasy & heterotachy

Impact of homoplasy & deviation from a lexical clock

Conclusions and comments

• Choice of reconstruction method does matter.

• Relative performance between methods is quite stable.

• Moderate homoplasy is not a problem.• Presence of some borrowing is not a

problem.• Some amount of heterotachy helps!

Future research

• We need more investigation of methods based on stochastic models (Bayesian beyond G+A, maximum likelihood, NJ with better distance corrections), as these are now the methods of choice in biology. This requires better models of linguistic evolution and hence input from linguists!

Future research (continued)

• Should we screen? The simulation uses low homoplasy as a proxy for screening, but real screening throws away data and may introduce bias.

• How do we detect/reconstruct borrowing?• How do we handle missing data in

methods based on stochastic models?• How do we handle polymorphism?

“What song the Sirens sang, or what name Achilles assumed when he hid himself among women, though puzzling questions, are not beyond all conjecture.” Thomas Browne

comparing reconstruction methods for linguistic phylogenies

Documents