16. lecture ws 2004/05bioinformatics iii1 v16 – genome rearrangement important information –...

16. Lecture WS 2004/05

Bioinformatics III 1

V16 – genome rearrangement

Important information – contained in the order in which genes occur on the

genomes of different species – allows inferring phylogenetic relationships.

Together with phylogenetic information, ancestral gene order reconstructions

give some clues about the conservation of the functional organisation of genomes

towards a global knowledge of life evolution.

Often, phylogeny reconstruction techniques using gene order data rely on the

definition of an evolutionary distance between two gene orders.

These distances are usually computed as the minimal number of

rearrangement operations needed to transform one genome into another one.

Bergeron et al. WABI 2004, 14-25 (2004)




Most choices of rearrangements quickly lead to hard algorithmic problems.

Therefore, the set of operations is usually restricted to reversals, translocations,

fusions or fissions where linear-time algorithms were developed in the last

years.

However, this choice of rearrangement operations is more dictated by algorithm

necessity than by biological reality. E.g., in some genomes, transpositions and

inverted transpositions can be quite common.

A family of phylogenetic approaches labelled „distance-based“ methods relies on

pair-wise evolutionary distances which are then fed into an algorithm such as

neighbor-joining to infer tree topology and branch lengths.

These methods do not provide information about the putative ancestral gene

order.





Parsimony-based approaches attempt to identify the rearrangement scenario

(including tree topology and gene orders at the internal nodes) that minimizes the

number of evolutionary events required.

problem is computationally much more difficult than just computing distances.

Heuristic algorithms exist that use either breakpoint or reversal distances.

However, these methods only provide us with one (or a small number of) possible

hypothesis about ancestral gene orders, with no information about alternate

optimal or near-optimal solutions.

Today:

- quick look at the reversal distance problem again

- new method „sets of conserved intervals“ (Bergeron & Jens Stoye)




Breakpoint Graph

The breakpoint graph of a permutation is an edge-colored graph G() with

n + 2 vertices {0, 1 ... n, n+1} {0, 1, ..., n, n+1}. We join vertices i and i+1 by

a black edge for 0 i n. We join vertices i and j by a gray edge if i j.

Black path

0 2 3 1 4 6 5 7

Grey path

0 2 3 1 4 6 5 7

Superposition of black and grey paths formsthe breakpoint graph:

A breakpoint graph is obtained by a super-position of a black pathtraversing the vertices0, 1, ..., n, n+1 in the order given by the permutation and a graypath traversing the verticesin the order given by theidentity permutation.



Cycle decomposition

A cycle in an edge-colored graph G is called alternating if the colors of every two

consecutive edges of this cycle are distinct. In the following, cycles will mean

alternating cycles.

Cycle decomposition ofthe breakpoint graph:

0 2 3 1 4 6 5 7

0 2 3 1 4 6 5 7

0 2 3 1 4 6 5 7

0 2 3 1 4 6 5 7

A vertex v in a graph G is called balanced if the

number of black edges incident to v equals the

number of grey edges incident to v.

A balanced graph is a graph in which every

vertex is balanced. G() is a balanced graph.

Therefore, there exists a cycle decomposition

of G() into edge-disjoint alternating cycles

(every edge in the graph belongs to exactly one

cycle in the decomposition). Cycles in an edge

decomposition may be self-intersecting. The

previous breakpoint graph can be decomposed

into 4 cycles, one of which is self-intersecting.



Effects of reversals on cycles

(A) For reversals acting on two

cycles, (b – c) = 1.

(B) For reversals acting on an

unoriented cycle, (b – c) = 0.

(C) For reversals acting on an

oriented cycle, (b – c) = -1

Hannenvalli, Pevzner, Journal of the ACM 46, 1 (1999)



Cycle decomposition

What is the decomposition of the breakpoint graph into a maximum number c()

of edge-disjoint alternating cycles? Here, c() = 4.

Cycle decompositions play an important role in estimating reversal distances.

When a reversal is applied to a permutation, the number of cycles in a maximum

decomposition can change by at most one (while the number of breakpoints

can change by two).

Bafna&Pevzner (1996) proved the bound for the reversal distance d():

d() n + 1 - c()

which is much tighter than the bound in terms of breakpoints d() b() / 2.

For many biological problems, d() = n + 1 - c().

Therefore, the reversal distance problem reduces to the problem of finding

the maximal cycle decomposition.

Hurdles, Super-hurdles, fortresses ...



Alternative concept: conserved intervals

Bergeron & Stoye, Report 2003-01 Uni Bielefeld

Distrance matrices can be used as data for phylogenetic reconstruction, or to

reconstruct ancestral genomes.

However, all distances (except for the breakpoint distance) are closely tied to

initial choices of allowable rearrangement operations.

They are pure distances because similarities between genomes are ignored.

breakpoint distance is based on the notion of conserved adjacencies. These are

easy to compute, but breakpoint distance often fails to capture more global

relations between genomes.

A first generalization of adjacencies: common intervals that identify subsets of

genes that appear consecutively in two or more genomes.

Jens Stoye



Permutations, Gene Order, and Rearrangements


Assume that the genes of an organism are ordered and oriented along linear or

circular DNA molecules. E.g. mitochondrial genes in insects

Collapse 38 genes into set of 17 blocks. Genes in one block do not change order

between these species.

Distance approaches: focus on the difference between 2 particular genomes.

E.g. Fruit Fly differs from Mosquito by the reversal of gene 10, and the

transposition of genes 7 and 8.

count minimal number of reversals and/or transpositions

distance matrix for the set of species





breakpoint distance: counts the lost adjacencies between genomes.

E.g. given the circularity of the genomes, Fruit Fly and Mosquito have 12

conserved adjacencies and a breakpoint distance of 5.

E.g. the first 4 species of table 1 share 6 adjacencies:

[1,2], [2,3], [11,12], [15,16], [16,17], and [17,1].

When comparing all 6 species, [17,1] is the only left adjacency.





Observation: the 6 permutations are very „similar“.

E.g. the genes in the interval [1,12] are all the same, with small variations in their

ordering.

This is also true for the genes in the intervals [3,6], [6,9], [9,11], and [12,17].

Such intervals, together with conserved adjacencies play a fundamental role in

rearrangement and distance theories, ancestral genome reconstructions, and

phylogeny.

Family portrait of the conserved intervals of the permutations of table 1

Here, the elements that can be glued together to form larger objects are boxed in

rectangles.



Which arrangements are preferable?


All permutations of table 1 fit the representation with the following conventions

(1) free objects within a rectangle can be reordered, or can change sign

(2) connections between rectangles are fixed.

Consider 2 rearrangement scenarios that transform silkworm into Locust using a

minimal number of reversals

The two scenarios are fundamentally different, although both use 6 reversals.

The right one uses much longer reversals than the left one, and the right one

breaks conserved intervals between Silkworm and Locust in intermediate

permutations, namely [3,6], [1,12], and [12,17].

The right scenario looks highly suspicious.



Conserved intervalsDefinition 1 Let G be a set of signed permutations of n elements. An interval

[a,b] is a conserved interval of the set G if:

(1) either a precedes b, or –b precedes –a, in each permutation, and

(2) the sets of unsigned elements that appear between a and b is the same for

all permutations in G.

If [a,b] is a conserved interval, so is [-b,-a].

Consider 2 permutations

P = 1 2 3 7 5 6 -4 8

Q = 1 7 -3 -2 5 -6 -4 8

Here, [1,5] and [2,3] are conserved intervals, but not [1,6].

The other conserved intervals of P and Q are [1,-4], [1,8], [5,-4], [5,8], and [-4,8].

The diagram representation of these intervals is

1 2 3 7 5 6 -4 8



Conserved intervalsWhen the identity permutation is not in G, it is always possible to rename the

elements of G such that conserved intervals will be intervals of consecutive

elements.

E.g. if one composes the permutations P and Q of the example with the inverse

permutation P-1,

P‘ = P-1 o P = 1 2 3 4 5 6 7 8

Q‘ = P-1 o Q = 1 4 -3 -2 5 -6 7 8

or 1 2 3 4 5 6 7 8

Proposition 1 Let R be a permutation and G a set of permutations, denote by

R o G the set of permutations obtained by composing each permutation in G with

R. The interval [a,b] is conserved in G if and only if the interval [R(a),R(b)] is

conserved in R o G.



Conserved intervalsProposition 1 Let R be a permutation and G a set of permutations, denote by

R o G the set of permutations obtained by composing each permutation in G with

R. The interval [a,b] is conserved in G if and only if the interval [R(a),R(b)] is

conserved in R o G.Proof: if a permutation P is written as P = p1 p2 ... pn

then R o P is: R o P = R(p1) R(p2) ... R(pn)

If [a,b] is conserved in G, then each permutation in G has a consecutive block of elements

beginning with a and ending with b, or beginning with –b and ending with –a. These

properties hold also for the set R o G, if one replaces a by R(a) and b by R(b).

Some intervals, such as [1,7] for the set {P‘,Q‘} in the above example, are the

union of smaller intervals: [1,7 ] = [1,5] [5,7]. Intervals that are not unions are

specially useful.

Definition 2 Conserved intervals that are not the union of shorter conserved

intervals are called irreducible.

Sets of conserved intervals can be characterized by the set of irreducible intervals.



Irreducible conserved intervals

Proposition 2 Two different irreducible conserved intervals [a,b] and [c,d] of a

set G of permutations are either

1) disjoint

2) nested with different endpoints, or

3) overlapping on one element.

Proof. Wlog we can assume that G contains the identity permutation and that conserved

intervals are intervals of consecutive elements.

Suppose that [a,b] and [c,d] are nested with a = c and d < b. Since [c,d] is a conserved

interval, it contains all integers between c and d the interval [d,b] contains all integers

between d and b, and [a,b] is not irreducible.

If [a,b] and [c,d] overlap with more than one element, we can suppose

a < c < b < d. Since all elements between c and d are greater than c, then the interval

between a and c must contain all elements between a and c, thus [a,b] is not irreducible.



Conserved intervals

Overlapping irreducible intervals form chains linked by their successive common

elements. A chain of k-1 intervals [a1,a2] [a2,a3] ... [ak-1,ak] will be denoted simply

by its k links [a1,a2,a3 ... ak].

E.g. [1,5,7,8] is a chain of the set of conserved intervals of P‘ and Q‘.

A maximal chain is a chain that cannot be extended.

Proposition 3. Every irreducible conserved interval belongs to a unique maximal

chain.

Proof: By Prop. 2: if [a,b] is an irreducible conserved interval, then no other can

begin by a or end by b.

Maximal chains, as sets of links, together with isolated genes, form a partition of

the set of genes.



Conserved intervals

A set of permutations on n elements can have as many as n(n-1)/2 conserved

intervals, but at most n-1 irreducible intervals.

These bounds are achieved with sets containing only one permutation.

Proposition 4. Each maximal chain of k links contributes k(k-1)/2 to the total

number of conserved intervals.

Proof. Conserved intervals [a,b] are in bijection with chains of the form

[a, x1, ..., kx, b]

of irreducible intervals. Each maximal chain of k links has k(k-1)/2 such sub-chains.



Conserved intervals

Proposition 5 Let P be a permutation that is contained in both sets G1 and G2.

The interval [a,b] is a conserved interval of G = G1 G2 if and only if there exist

two chains of irreducible conserved intervals, with respect to P, with k 0, l 0:

[a, x1, ..., kx, b] in G1

[a, y1, ..., yl, b] in G2.

The interval [a,b] is irreducible if and only if {x1, ..., xk} and {y1, ..., yl} are disjoint.

Proof. The interval [a,b] is a conserved interval of G if and only if it is a conserved interval

in both G1 and G2, therefore there must exist chains beginning by a and ending by b for

both sets G1 and G2. If [a,b] is irreducible in G, and if [a,x] and [x,b] are conserved intervals

of G1, say, then x cannot belong to the set {y1, ..., yl}. If there is a common element x to

both sets {x1, ..., xk} and {y1, ..., yl}, then [a,b] = [a,x] [x,b] and both [a,x] and [x,b] are

conserved intervals of G.



Variable Geometry Genomes

Multi-chromosomal genomes can also be represented by permutations, with

special marks that identify different chromosomes.

E.g.

where each chromosome is on a separate line.

Even if the adjacency [5,6] is conserved between the 2 permutations, the first

genome does not even have those genes on the same chromosome.

In the case of multi-chromosomal genomes, conserved intervals [a,b] should

have the added requirement that a and b belong to the same chromosome, in

each genome.

The definition of conserved intervals can be adapted to other types of genomes

than single linear chromosomes. For circular genomes, one can always align all

permutations of the set beginning with gene +1.



Algorithms

Bergeron & Stoye present 3 algorithms:

(1) compute the conserved intervals of two permutations

(2) compute the conserved intervals of a set of permutations

(3) compute conserved intervals of two sets of permutations, directly from their

two individual sets of conserved intervals.

Conserved Intervals of 2 permutations are strongly related to the notion of

connected components of the overlap graph of a signed permutation.

Here: linear algorithm that identifies all irreducible intervals [a,b] of a permutation

with the identity permutation such that a > 0 and b > 0 in .

The case of negative endpoints is treated by reversing .

E.g. for the permutation P = 0 -4 -3 -2 5 8 6 7 9 -1 10

algorithm 1 identifies the positive irreducible intervals [6,7], [5,9], and [0,10].

It will identify [2,3] and [3,4] on the reversed permutation.



Algorithms

The algorithm assumes that the input permutation is in the form

= (0, 1, ..., n-1, n)

Mi: nearest unsigned element of the permutation that precedes i and is greater

than |i|.

Lemma 1 If [s,e] is a positive conserved interval of and the identity

permutation, then Ms = Me.

Algorithm uses two stacks: S contains the possible start positions of conserved

intervals, M contains possible candidates for Mi.

The top of S is always denoted by s. The top of M is always denoted by m.

Proposition 6 Algorithm 1 outputs the positive irreducible conserved intervals of

a permutation with the identity permutation in O(n) time.



Conserved intervals

Algorithm runs in linear time.



Similarity and distance

The number of conserved intervals of a set of permutations is a measure of

similarity, but can easily be transformed into a distance between two

permutations, or two sets of two permutations.

Definition 3 Let G1 and G2 be two permutations on n elements, with N1 and N2

conserved intervals. Let N be the number of conserved intervals in G1 G2.

The interval distance between G1 and G2 is then defined by:

d(G1,G2) = N1 + N2 – 2N

The interval distance satisfies the fundamental properties of a mathematical

distance, e.g. it fulfils the triangle inequality:

d(P,Q) + d(Q,R) d(P,R)




When comparing two permutations, the interval distance counts the total number

of intervals that are unique to one of them. E.g. the distance between

P = 0 1 2 3 4 5 6 7 8 9 10

Q = 0 5 -7 -6 8 9 1 2 3 4 10

is given by d(P,Q) = (1110)/2 +(1110)/2 – 2 11 = 88

The 2 measures sometimes disagree. The behavior of the interval distance

reflects that the length (number of genes) involved in a rearrangement operation

matters: short reversals are less disturbing than long ones.



Comparison with other distance measures

Breakpoint distance also gives different results than interval distances.

while the same results are obtained by transposition + reversal distances.




Proposition 7 Suppose that P and Q have n elements, then

(1) if P is obtained from Q by reversing k elements, then the interval distance

between P and Q is k (n – k);

(2) if P is obtained from Q by transposing two consecutive blocks of a and b

elements, then the interval distance between P and Q is (a+b)(n – (a+b)) + ab.

Because the interval distance is affected by length, one should question the

practice of collapsing identical strips of genes.

Why not use all available information?



Link with rearrangement theoriesCharacterize the rearrangement operations that preserve conserved intervals.

Definition 4. Let P and Q be two permutations, and a rearrangement operation

applied to P yielding P‘. We say that preserves the conserved intervals of P and

Q if the conserved intervals of {P,Q} are contained in those of {P‘,Q}.

Only rearrangements within blocks are preserving. Note that all operations, except

fusions, destroy some adjacencies that existed in the original permutation: the

number and nature of these adjacencies is a key concept.

Definition 5. Let be a rearrangement operation that transforms P into P‘.

A breakpoint of is a pair of elements that are adjacent in P but not in P‘.

Breakpoints are where one has to cut P in order to apply .

Reversals and translocations have 2 breakpoints, transpositions have 3, and

fissions have 1.



Link with rearrangement theories

Consider the irreducible intervals of P and P‘ with respect to P.

Adjacencies in P either belong to a (smallest) irreducible interval, or are free.

E.g. in this diagram

the adjacency (3,4) belongs to the interval [1,5], (2,3) belongs to [2,3], and (8,9)

is free.

When 2 adjacencies belong to the same irreducible interval, then none of these

adjacencies is conserved between P and P‘.




Theorem 3. Reversals, transpositions, and reverse transpositions are preserving

if and only if all their breakpoints belong to the same irreducible interval, or are

free. Translocations and fissions are preserving if and only if all their breakpoints

are free.

Proof. If the breakpoints of any operation are free, then no conserved interval is cut.

If the breakpoints of a reversal, transposition, or reverse transposition belong to the same

irreducible interval, then the operation reorders, or reverses, some blocks within that

interval, thus preserving conserved intervals.

If a reversal has its two breakpoints in different intervals, it will break those two intervals. If

it has only one free breakpoint, it will break the interval containing the other breakpoint.

The same kind of arguments hold for transpositions and reverse transpositions.

If a breakpoint of a translocation or fission is not free, then it belongs to an irreducible

interval whose extremities will end up in two different chromosomes.

It turns out that most rearrangement operations used in optimal scenarios are

indeed preserving.




E.g. (without proof)

Theorem 4. All the breakpoints of a cycle belong to the same irreducible interval.

In the sorting by reversals theory, a sorting reversal is defined as a reversal that

decreases the reversal distance by 1. The breakpoints of sorting reversals,

except one type called hurdle merging, belong to a single cycle.

Corollary 4. All sorting reversals, except hurdle merging, are preserving

Corollary 5. All transpositions that create two adjacencies are preserving.



Apply conserved intervals to reconstruct ancestor




Summary

Linear-time algorithms could be developed to minimize reversal distance

rearrangement scenarios.

Open question which distance measures (breakpoint distance, reversal distance,

interval distance ...) are most appropriate to compare genome architectures.

Experimental evidence provides new insights which types of rearrangements

have likely occurred in the past need to adopt algorithms to the biological

reality.

Concept of „conserved intervals“ sounds very promising – can account for

arbitrary types of rearrangements.

16. lecture ws 2004/05bioinformatics iii1 v16 – genome rearrangement important information –...

Documents

gene order data

edgecolored graph g

ancestral gene orders

putative ancestral gene

phylogenetic information

rearrangement scenario

number of black edges

black pathtraversing