genome rearrangements csci 7000-005: computational genomics debra goldberg [email protected]

62
Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg [email protected]

Post on 21-Dec-2015

235 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Genome Rearrangements

CSCI 7000-005: Computational Genomics

Debra Goldberg

[email protected]

Page 2: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Turnip vs Cabbage

• Share a recent common ancestor

• Look and taste different

Page 3: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Turnip vs Cabbage

• Comparing mtDNA gene sequences yields no evolutionary information

• 99% similarity between genes

• These surprisingly identical gene sequences differed in gene order

• This study helped pave the way to analyzing genome rearrangements in molecular evolution

Page 4: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Turnip vs Cabbage: Different mtDNA Gene Order

• Gene order comparison:

Page 5: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Turnip vs Cabbage: Different mtDNA Gene Order

• Gene order comparison:

Page 6: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Turnip vs Cabbage: Different mtDNA Gene Order

• Gene order comparison:

Page 7: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Turnip vs Cabbage: Different mtDNA Gene Order

• Gene order comparison:

Page 8: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Turnip vs Cabbage: Gene Order Comparison

Before

After

• Evolution is manifested as the divergence in gene order

Page 9: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Transforming Cabbage into Turnip

Page 10: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Reversals1 32

4

10

56

8

9

7

1, 2, 3, -8, -7, -6, -5, -4, 9, 10

Blocks represent conserved genes. In the course of evolution or in a clinical context, blocks

1,…,10 could be misread as 1, 2, 3, -8, -7, -6, -5, -4, 9, 10.

1, 2, 3, 4, 5, 6, 7, 8, 9, 10

Blocks represent conserved genes.

Page 11: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Types of Mutations

Page 12: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Types of Rearrangements

Reversal1 2 3 4 5 6 1 2 -5 -4 -3 6

Page 13: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Types of Rearrangements

Translocation1 2 3 44 5 6

1 2 6 4 5 3

Page 14: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Types of Rearrangements

1 2 3 4 5 6

1 2 3 4 5 6

Fusion

Fission

Page 15: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

• What are the similarity blocks and how to find them?

• What is the architecture of the ancestral genome?

• What is the evolutionary scenario for transforming one genome into the other?

Unknown ancestor~ 75 million years ago

Mouse (X chrom.)

Human (X chrom.)

Genome rearrangements

Page 16: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Why do we care?

Page 17: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

SKY (spectral karyotyping)

Page 18: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Robertsonian Translocation

13 14

Page 19: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Robertsonian Translocation

• Translocation of chromosomes 13 and 14

• No net gain or loss of genetic material: normal phenotype.

• Increased risk for an abnormal child or spontaneous pregnancy loss

13 14

Page 20: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Philadelphia Chromosome

9

22

Page 21: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Philadelphia Chromosome

• Seen in about 90% of patients with Chronic myelogenous leukemia (CML)

• A translocation between chromosomes 9 and 22 (part of 22 is attached to 9)

9 22

Page 22: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Inversion

16

Page 23: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Inversion

• Chromosome on the right is inverted near the centromere

16

Page 24: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Colon cancer

Page 25: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Colon Cancer

Page 26: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Comparative Mapping: Mouse vs Human Genome

• Humans and mice have similar genomes, but their genes are ordered differently

• ~245 rearrangements– Reversals

– Fusions

– Fissions

– Translocations

Page 27: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Waardenburg’s Syndrome: Mouse Provides Insight into Human

Genetic Disorder

• Characterized by pigmentary dysphasia• Gene implicated linked to human

chromosome 2 • It was not clear

where exactly on chromosome 2

Page 28: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Waardenburg’s syndrome and splotch mice

• A breed of mice (with splotch gene) had similar symptoms caused by the same type of gene as in humans

• Scientists identified location of gene responsible for disorder in mice

• Finding the gene in mice gives clues to where same gene is located in humans

Page 29: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Reversals: Example

= 1 2 3 4 5 6 7 8

1 2 5 4 3 6 7 8

Page 30: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Reversals: Example

= 1 2 3 4 5 6 7 8

1 2 5 4 3 6 7 8

1 2 5 4 6 3 7 8

Page 31: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Reversals and Gene Orders

• Gene order represented by permutation 1------ i-1 i i+1 ------j-1 j j+1 -----n

1------ i-1 j j-1 ------i+1 i j+1 -----n

Reversal ( i, j ) reverses (flips) the elements from i to j in

,j)

Page 32: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Reversal Distance Problem

• Goal: Given two permutations, find shortest series of reversals to transform one into another

• Input: Permutations and

• Output: A series of reversals 1,…t

transforming into such that t is minimum

• t - reversal distance between and

• d(, ) = smallest possible value of t, given

Page 33: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Sorting By Reversals Problem

• Goal: Given a permutation, find a shortest series of reversals that transforms it into the identity permutation (1 2 … n )

• Input: Permutation

• Output: A series of reversals 1, … t transforming into the identity permutation such that t is minimum

• min t =d( ) = reversal distance of

Page 34: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Sorting by reversalsExample: 5 steps

Step 0: 2 -4 -3 5 -8 -7 -6 1Step 1: 2 3 4 5 -8 -7 -6 1Step 2: 2 3 4 5 6 7 8 1Step 3: 2 3 4 5 6 7 8 -1Step 4: -8 -7 -6 -5 -4 -3 -2 -1Step 5: g 1 2 3 4 5 6 7 8

Page 35: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Sorting by reversalsExample: 4 steps

Step 0: 2 -4 -3 5 -8 -7 -6 1Step 1: 2 3 4 5 -8 -7 -6 1Step 2: -5 -4 -3 -2 -8 -7 -6 1Step 3: -5 -4 -3 -2 -1 6 7 8Step 4: g 1 2 3 4 5 6 7 8

What is the reversal distance for this permutation? Can it be sorted in 3 steps?

Page 36: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Pancake Flipping Problem• Chef prepares unordered

stack of pancakes of different sizes

• The waiter wants to sort (rearrange) them, smallest on top, largest at bottom

• He does it by flipping over several from the top, repeating this as many times as necessary

Christos Papadimitrou and Bill Gates flip pancakes

Page 37: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Sorting By Reversals: A Greedy Algorithm

• Example: permutation = 1 2 3 6 4 5

• First three elements are already in order

• prefix() = length of already sorted prefix

– prefix() = 3

• Idea: increase prefix() at every step

Page 38: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

• Doing so, can be sorted

1 2 3 6 4 5

1 2 3 4 6 5

1 2 3 4 5 6

• Number of steps to sort permutation of length n is at most (n – 1)

Greedy Algorithm: An Example

Page 39: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Greedy Algorithm: Pseudocode

SimpleReversalSort()1 for i 1 to n – 12 j position of element i in (i.e., j = i)

3 if j ≠i4 * (i, j)5 output 6 if is the identity permutation 7 return

Page 40: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Analyzing SimpleReversalSort

• SimpleReversalSort does not guarantee the smallest number of reversals and takes five steps on = 6 1 2 3 4 5 :

•Step 1: 1 6 2 3 4 5•Step 2: 1 2 6 3 4 5 •Step 3: 1 2 3 6 4 5•Step 4: 1 2 3 4 6 5•Step 5: 1 2 3 4 5 6

Page 41: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

• But it can be sorted in two steps:

= 6 1 2 3 4 5 – Step 1: 5 4 3 2 1 6 – Step 2: 1 2 3 4 5 6

• So, SimpleReversalSort() is not optimal

• Optimal algorithms are unknown for many problems; approximation algorithms used

Analyzing SimpleReversalSort (cont’d)

Page 42: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Approximation Algorithms

• These algorithms find approximate solutions rather than optimal solutions

• The approximation ratio of an algorithm A on input is: A() / OPT()where

A() -solution produced by algorithm A OPT() - optimal solution of the problem

Page 43: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Approximation Ratio / Performance Guarantee

• Approximation ratio (performance guarantee) of algorithm A: max approximation ratio of all inputs of size n

– For algorithm A that minimizes objective function (minimization algorithm):

•max|| = n A() / OPT()

– For maximization algorithm:

•min|| = n A() / OPT()

Page 44: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

= 23…n-1n

• A pair of elements i and i + 1 are adjacent if

i+1 = i + 1

• For example:

= 1 9 3 4 7 8 2 6 5

• (3, 4) or (7, 8) and (6,5) are adjacent pairs

Adjacencies and Breakpoints

Page 45: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

There is a breakpoint between any adjacent element that are non-consecutive:

= 1 9 3 4 7 8 2 6 5

• Pairs (1,9), (9,3), (4,7), (8,2) and (2,5) form breakpoints of permutation

• b() - # breakpoints in permutation

Breakpoints: An Example

Page 46: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Adjacency & Breakpoints

•An adjacency – consecutive

•A breakpoint – not consecutive

π = 5 6 2 1 3 4 0 5 6 2 1 3 4 7

adjacencies

breakpoints

Extend π with π0 = 0 and πn+1 = n+1

Page 47: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

• Add 0 =0 and n + 1=n+1 at ends of

Example:

Extending with 0 and 10

Note: A new breakpoint was created after extending

Extending Permutations

= 1 9 3 4 7 8 2 6 5

= 0 1 9 3 4 7 8 2 6 5 10

Page 48: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Each reversal eliminates at most 2 breakpoints.

= 2 3 1 4 6 5

0 2 3 1 4 6 5 7 b() = 5

0 1 3 2 4 6 5 7 b() = 4

0 1 2 3 4 6 5 7 b() = 2

0 1 2 3 4 5 6 7 b() = 0

Reversal Distance and Breakpoints

This implies: reversal distance ≥ #breakpoints / 2

Page 49: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Sorting By Reversals: A Better Greedy Algorithm

BreakPointReversalSort()

1 while b() > 02 Among all possible reversals,

choose reversal minimizing b( • )3 • (i, j)4 output 5 return

Problem: this algorithm may work forever

Page 50: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Strips

• Strip: an interval between two consecutive breakpoints in a permutation

– Decreasing strip: elements in decreasing order (e.g. 6 5)

– Increasing strip: elements in increasing order (e.g. 7 8)

0 1 9 4 3 7 8 2 5 6 10

– Consider single-element strips decreasing except strips 0 and n+1 are increasing

Page 51: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Reducing the Number of Breakpoints

Theorem 1:

If permutation contains at least one decreasing strip, then there exists a reversal which decreases the number of breakpoints (i.e. b(• ) < b() )

Page 52: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Things To Consider

• For = 1 4 6 5 7 8 3 2

0 1 4 6 5 7 8 3 2 9 b() = 5

– Choose decreasing strip with the smallest element k in (k = 2 in this case)

Page 53: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Things To Consider

• For = 1 4 6 5 7 8 3 2

0 1 4 6 5 7 8 3 2 9 b() = 5

– Choose decreasing strip with the smallest element k in (k = 2 in this case)

Page 54: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Things To Consider

• For = 1 4 6 5 7 8 3 2

0 1 4 6 5 7 8 3 2 9 b() = 5

– Choose decreasing strip with the smallest element k in (k = 2 in this case)

– Find k – 1 in the permutation

Page 55: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Things To Consider

• For = 1 4 6 5 7 8 3 2

0 1 4 6 5 7 8 3 2 9 b() = 5

– Choose decreasing strip with the smallest element k in (k = 2 in this case)

– Find k – 1 in the permutation

– Reverse segment between k and k-1: 0 1 2 3 8 7 5 6 4 9 b() = 4

Page 56: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Reducing the Number of Breakpoints Again

– If there is no decreasing strip, there may be no reversal that reduces the number of breakpoints (i.e. b(•) ≥ b() for any reversal ).

– By reversing an increasing strip ( # of breakpoints stay unchanged ), we will create a decreasing strip at the next step. Then the number of breakpoints will be reduced in the next step (theorem 1).

Page 57: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Things To Consider (cont’d)

• There are no decreasing strips in , for:

= 0 1 2 5 6 7 3 4 8 b() = 3

•(3,4) = 0 1 2 5 6 7 4 3 8 b() = 3

(3,4) does not change the # of breakpoints(3,4) creates a decreasing strip thus

guaranteeing that the next step will decrease the # of breakpoints.

Page 58: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

ImprovedBreakpointReversalSortImprovedBreakpointReversalSort()1 while b() > 02 if has a decreasing strip3 Among all possible reversals, choose reversal

that minimizes b( • )4 else5 Choose a reversal that flips an increasing

strip in 6 • 7 output 8 return

Page 59: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

• ImprovedBreakPointReversalSort is an approximation algorithm with a performance guarantee of at most 4– It eliminates at least one breakpoint in every

two steps; at most 2b() steps– Approximation ratio: 2b() / d()– Optimal algorithm eliminates at most 2

breakpoints in every step: d() b() / 2– Performance guarantee:

( 2b() / d() ) [ 2b() / (b() / 2) ] = 4

Performance Guarantee

Page 60: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Signed Permutations

• Up to this point, reversal sort algorithms sorted unsigned permutations

• But genes have directions… so we should consider signed permutations

5’ 3’

= 1 -2 - 3 4 -5

Page 61: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Signed Permutation

• Genes are directed fragments of DNA

• Genes in the same position but different orientations do not have same gene order

• These two permutations are not equivalent gene sequences

1 2 3 4 5

-1 2 -3 -4 -5

Page 62: Genome Rearrangements CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

Signed permutations are easier!

Polynomial time (optimal) algorithm is known