Genome Rearrangement PhylogenyGenome Rearrangement Phylogeny
Robert K. JansenSchool of Biology
University of Texas at Austin
Bernard M.E. MoretDepartment of Computer Science
University of New Mexico
Li-San Wang Tandy WarnowDepartment of Computer Sciences
University of Texas at Austin
2
Outline
• Introduction• Genome rearrangement phylogeny
reconstruction• Application• Other methods• Future research
3
New Phylogenetic Signals
• Large-throughput sequencing efforts lead to larger datasets− Challenge: inferring deep evolutionary events
• Biologists turning to “rare genomic changes”− Rare− Large state space− High signal-to-noise ratio− Potential for clarifying early evolution− Best studied: gene order evolution
(genome rearrangement)
5
Gene Order Data
• Rare changes on the genomic scale• Large state space
− DNA: 4 states/character
− Protein (amino acid sequence): 20 states/character
− Circular gene order with 120 genes:
• High signal-to-noise ratio
232119 107049.3!1192 states/character
6
Genomes Evolve by Rearrangements
1 2 3 4 5 6 7 8 9 10
1 2 –6 –5 -4 -3 7 8 9 10
1 2 7 8 3 4 5 6 9 10
1 2 7 8 –6 -5 -4 -3 9 10
Inversion:
Transposition:
Inverted Transposition:
7
Edit Distances Between Genomes
• (INV) Inversion distance [Hannenhalli & Pevzner 1995]
− Computable in linear time [Moret et al 2001]• (BP) Breakpoint distance [Watterson et al. 1982]
− Computable in linear time− NJ(BP): [Blanchette, Kunisawa, Sankoff, 1999]
1 2 3 4 5 6 7 8 9 10
1 2 3 -8 -7 -6 4 5 9 10
A =
B =
BP(A,B)=3
8
Our Model: the Generalized Nadeau-Taylor Model [STOC’01]
• Three types of events: − Inversions (INV)− Transpositions (TRP)− Inverted Transpositions (ITP)
• Events of the same type are equiprobable• Probabilities of the three types have fixed ratio
• We focus on signed circular genomes in this talk.
9
Simulation Study Protocol
Synthetic InputSynthetic Input
PhylogeneticMethod
PhylogeneticMethod
DB
A
CE
FInferred Tree
A B
CD
E F
True TreeEvolutionary Process
Known in simulation
A 1 -4 2 -3 5 6
B 1 3 2 5 4 6
C 1 3 -2 5 6 4
D 1 3 4 5 -2 6
E 1 4 5 -3 -2 6
F 1 2 3 4 5 6
10
Quantifying Error
FN: false negative (missing edge)
1/3=33.3% error rate
A B
CD
E F
True Tree
A B
CD
E F
True Tree
A B
CD
E F
True Tree DB
A
CE
F
Inferred Tree
DB
A
CE
F
DB
A
CE
F
Inferred Tree
11
Outline
• Genome rearrangement evolution• Genome rearrangement phylogeny
reconstruction• Application• Other methods• Future research
12
Gene Order Parsimony
Length (T) = min AX+BX+XY+CY+YZ+DZ+YW+EW+FW
X,Y,Z,W
A
B
C
D
E
F
X
Y
Z W
A
B
C
D
E
F
X
Y
Z W
A
B
C
D
E
F
13
Breakpoint Phylogeny[Sankoff & Blanchette 1998]
• “Maximum Parsimony”-style problem:− Find tree(s), leaf-labeled by genomes, with
shortest breakpoint length
• NP-hard problem on two levels:− Find the shortest tree (the space of trees has
exponential size)− Given a tree, find its breakpoint length
(Even for a tree with 3 leaves, but can be reduced to TSP)
• BPAnalysis [Sankoff & Blanchette 1998] − Takes 200 years to compute our 13-taxon
dataset on a Sun workstation
14
X
Y
Z W
A
B
C
D
E
F
X
Y
Z W
A
B
C
D
E
F
BPAnalysis
• Tree length evaluation for EVERY tree• Given a fixed tree topology, evaluate the tree
length:− Iteratively evaluate the median problem (tree
length for a 3-leaf tree)
A
BY
X’C
Z
Y’X’
15
GRAPPA (Genome Rearrangement Analysis under
Parsimony and other Phylogenetic Algorithms)
http://www.cs.unm.edu/~moret/GRAPPA/• Uses lowerbound techniques to speed up• Used on real datasets, producing thousand-fold
speedups over BPAnalysis [ISMB’01]• Contributors: (led by Bernard Moret at UNM)
U. New MexicoU. Texas at AustinUniversitá di Bologna, Italy
16
The Circular Lowerbound of the Length of a Tree
• Given a tree, we can lowerbound its length very quickly:
1 2 3 4
1,44,33,22,1)()(2 ddddTlbTw
)(2 Tw
)(Tlb
17
The Lowerbound Technique
• Avoid any tree X without potential:− tree X whose lowerbound lb(X) is higher than
twice the length c(T) of the best tree T
• Finding a good starting tree quickly is of utmost importance
• We turn to distance-based methods− Neighbor joining (NJ) [Saitou and Nei 1987]− Weighbor [Bruno et al. 2000]
18
Additive Distance Matrix and True Evolutionary Distance (T.E.D.)
S2 S3 S4 S5
S1 0 9 15 14 17S2 0 14 13 16S3 0 13 16S4 0 13 13
75
4
5
8
S1
S2
S3
S4
S5
S1
S5 0
Theorem [Waterman et al. 1977] Given an m×m additive distance matrix, we can reconstruct a tree realizing the distance in O(m2) time.
19
Error Tolerance of Neighbor Joining
Theorem [Atteson 1999]Let {Dij} be the true evolutionary distances, and {dij} be the estimated distances for T. Let be the length of the shortest edge in T. If for all taxa i,j, we have
then neighbor joining returns T.
2
1|| ijij dD
20
BP and INV
INV vs K(120 genes)
(K: Actual number of inversions) (Inversion-only evolution)
BP/2 vs K
21
NJ(BP) [Blanchette, Kunisawa, Sankoff 1999] and NJ(INV)
120 genes, 160 leavesUniformly Random Tree
Transpositions/inverted transpositions only
Inversion only
22
Estimate True Evolutionary DistancesUsing BP
BP/2 vs K (120 genes)
(K: Actual number of inversions) (Inversion-only evolution)
To use the scatter plot to estimate the actual number of events (K):
1. Compute BP/2
2. From the curve, look up the corresponding valueof K
(1)
(2)
23
True Evolutionary Distance (t.e.d.) Estimators for Gene Order Data
T.E.D. Estimator
Exact-IEBP [WABI’01]
Approx-IEBP [STOC’01]
EDE [ISMB’01]
Based on the Expectation of
Breakpoint distance (Exact)
Breakpoint distance (Approx.)
Inversion distance (Approx.)
Derivation Analytical Analytical Empirical
Model knowledge
Required Required Inversion-only
IEBP: Inverting the Expected BreakPoint distanceEDE: Empirically Derived Estimator
24
True Evolutionary Distance Estimators
Exact-IEBP vs K(120 genes)
(K: Actual number of inversions) (Inversion-only evolution)
BP vs K
25
Variance of True Evolutionary Distance Estimators
• There are new distance-based phylogeny reconstruction methods (though designed for DNA sequences) − Weighbor [Bruno et al. 2000]
These methods use the variance of good t.e.d.’s, and yield more accurate trees than NJ.
• Variance estimates for the t.e.d.s [Wang WABI’02]− Weighbor(IEBP),
Weighbor(EDE) K vs Exact-IEBP (120 genes)
26
Using T.E.D. Helps
120 genes160 leavesUniformly random treeTranspositions/invertedtranspositions only(180 runs per figure)
5%
27
Observations
• EDE is the best distance estimator when used with NJ and Weighbor.
• True evolutionary distance estimators are reliable even when we do not know the GNT model parameters (the probability ratios of the three types of events).
28
Outline
• Genome rearrangement evolution• Genome rearrangement phylogeny reconstruction• Application• Other methods• Future research
29
Percentage of Trees Eliminated Through Bounding [ISMB’01]
edge length=2
# taxa 10 20 40 80 160
10 0 0 0 1% 1%
20 0 80% 91% 1% 1%
40 91% 100% 100% 100% 100%
80 99% 100% 100% 100% 100%
160 100% 100% 100% 100% 100%
320 100% 100% 100% 100% 100%
#genes Uses NJ(EDE) as starting tree
30
Campanulaceae cpDNA
• 13 taxa (tobacco as outlier)• 105 gene segments• GRAPPA finds 216 trees with shortest breakpoint
length (out of 654,729,075 trees)
• Running Time:− BPAnalysis takes 2 centuries on a Sun
workstation− GRAPPA takes 1.5 hours on a 512-node
supercluster− About 2300-fold speedup on a single node
31
Campanulaceae [Moret et al. ISMB 2001]
Strict consensus of 216 optimal trees found by GRAPPA
Tob
acco
Pla
tyco
don
Cya
nant
hus
Cod
onop
sis M
erci
era
Wah
lenb
ergi
a
Tri
odan
is
Asy
neum
a Leg
ousi
a
Sym
phan
dra
Ade
noph
ora
Cam
panu
la Tra
chel
ium
6 out of 10 max. edges found
32
Outline
• Genome rearrangement evolution• Genome rearrangement phylogeny reconstruction• Application• Other methods• Future research
33
“Fast” Approaches for Genome Rearrangement Phylogeny
• Basic technique: encode data as strings and apply maximum parsimony
• Running time exponential in the number of genomes, but polynomial in the number of genes (faster than GRAPPA)
• MPBE [ISMB’00]Maximum Parsimony using Binary Encodings
• MPME [Boore et al. Nature ’95, PSB’02]Maximum Parsimony using Multi-state Encodings
• The length of a tree using these two methods is a lowerbound of the true breakpoint length [Bryant ’01]
34
Maximum Parsimony using Binary Encoding (MPBE)
A: 1 2 3 4 = -4 –3 –2 –1B: 1 -4 -3 –2 = 2 3 4 -1C: 1 2 -3 –4 = 4 3 –2 -1
MPBE Strings
Input genome (circular)(1,2)
(2,3)
(3,4)
(4,1)
(1,-4)
(-2,1)
(2,-3)
(-3,-4)
(-4,1)
A: 1 1 1 1 0 0 0 0 0B: 0 1 1 0 1 1 0 0 0C: 1 0 0 0 0 0 1 1 1
35
Maximum Parsimony using Multistate Encoding (MPME)
MPME Strings
Input genome (circular)
A: 2 3 4 1 –4 –1 –2 -3B: -4 3 4 –1 2 1 –2 -3C: 2 –3 -2 3 4 –1 -4 1
1 2 3 4 -1 –2 –3 -4
A: 1 2 3 4 = -4 –3 –2 –1B: 1 -4 -3 –2 = 2 3 4 -1C: 1 2 -3 –4 = 4 3 –2 -1
We use PAUP to solve Maximum Parsimony
=> Constraint: number of states per site cannot exceed 32
36
NJ vs MP (120 genes, 160 genomes)
All three event types equiprobable(datasets that exceed 32-state limit for MPME are dropped)
37
Inversion Phylogeny
• Inversion median has higher running time than breakpoint median
• Inversion phylogeny overall has shorter running time than breakpoint phylogeny, and returns more accurate trees [Moret et al. WABI ’02]
38
• Disk-Covering Method: divide the original problem into subproblems [Huson, Nettles, Parida, Warnow and Yooseph, 1998]
• Uses inversion distance• DCM-GRAPPA: can now process thousands of
genomes, each having hundreds of genes
DCM-GRAPPA [Moret & Tang 2003]
39
Ongoing and Future Research
• Genome rearrangement phylogeny with unequal gene content (duplications, deletions, etc.)
• Non-uniform genome rearrangement models(Segment-length dependent model, hotspots)
40
Acknowledgements
• University of Texas Tandy Warnow (Advisor) Robert K. Jansen Stacia Wyman Luay Nakhleh Usman Roshan Cara Stockham Jerry Sun
• University of New Mexico Bernard M.E. Moret David Bader Jijun Tang Mi Yan
• Central Washington University Linda Raubeson