improved algorithms for inferring the minimum mosaic of a set of recombinants yufeng wu and dan...

14
Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007

Upload: tyson-florey

Post on 14-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007

Improved Algorithms for Inferring the Minimum Mosaic of a Set of

Recombinants

Yufeng Wu and Dan Gusfield

UC Davis

CPM 2007

Page 2: Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007

2

Recombination

• Recombination: one of the principle genetic forces shaping sequence variations within species.

• Two equal length sequences generate a new equal length sequence.

110001111111001

000110000001111

Prefix

Suffix

11000 0000001111

Breakpoint

Page 3: Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007

Founders and Mosaic• Current sequences are descendents of a small

number of founders.– A current sequence is composed of blocks from the

founders, due to recombination.– No mutations since formation of founders.

000000

111111

000000

111111

001111

000000

111111

001111

111100

Breakpoint

Founders

Sampled sequences in current population

000000

001111

111100

011100

Mosaic

000000

001111

111100

011100

Page 4: Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007

4

The Minimum Mosaic Problem• Given a set of aligned binary sequences in the current

population and assume the number of founders is known to be Kf, find set of founders and the mosaic with the minimum number of breakpoints.

1101101

1010001

0111111

0110100

1100011

Assume Kf =3

1101101

1010001

0111111

0110100

1100011

1101111

1010001

0110100

Three Founders

Four breakpoints: minimum for all possible three founders

Page 5: Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007

5

Status of the Minimum Mosaic Problem

• First studied by E. Ukkonen (WABI 2002).– Dynamic programming method. Not practical when

the number of rows is more than 20 and Kf >2.

• No polynomial-time algorithm was known even when Kf is small. No NP-completeness result is known.

• Our results:– A simple polynomial-time algorithm for Kf = 2 case. – Exact and practical method for data of medium

range for Kf 3 .

Page 6: Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007

The Two-Founder Case110111101

100100101

010111111

010101100

110000111

Key: at columns 1 and 2, the founders are either or .

There are two rows with 00/11, and three rows with 01/10. So, at least two breakpoints between columns 1 and 2 with founders as .

1111101

1010001

0111111

0110100

1100011

Founders

Remove uniform columns

0?

1?

1111101

1010001

0111111

0110100

1100011

0?

1?

01

10

2 breakpoints between c1 and c2

00

11

2 breakpoints between c2 and c3

Study pairs of neighboring columns

Page 7: Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007

The Two-Founder Case (Cont.)

No matter which founder states are chosen for previous column, we can always choose the needed founders for current column.

2 2 2 1 2 2

# breakpoints between two columns

Local founders

c1 c2 c3 c4 c5 c6 c7

At least 2 + 2 + 2 +1 +2 +2 = 11 breakpoints needed.

On the other hand, we can construct two founders that use the same local optimal founders, and thus 11 breakpoints is global optimum.

Founders 0

1

1

0

1

0

1

0

1

0

Page 8: Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007

8

Three or More Founders: Assuming Known Founders

1101101

1010001

0111111

0110100

1100011

Three Founders

1101101

1101111

1010001

0110100

With known founders, can minimize breakpoints for each sequence, and thus also minimize the total number of breakpoints.

For each input sequence, starting from the left, insert a breakpoint at the end of longest segments matching one founder.

Founder mapping: at each position c in any input sequence s, which founder s[c] takes its value from.

Breakpoint!

Input Sequences

1101101

Founder 1 Founder 2

Founder Mapping

Page 9: Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007

Enumerating Founders for Founder-Unknown Case

In reality, founders are not known. A straightforward way is to simply enumerate all possible sets of founders, and then run the previous method to find the minimum mosaic.

100

001

011

101

110

010

At each column, there are 2kf–2 founder settings.

Let m be the number of columns, fully enumerate all possible sets of

founders takes (2m*kf) time. Infeasible when m or Kf is large.

Need more ideas to develop a practical method. First, we do the enumeration in the form of search paths in a search tree.

Page 10: Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007

Search Paths and Search Tree

It works but exponential blowup of the search paths!

Obvious idea to reduce search space: branch and bound (compute a lower bound and …).

But we found a different idea is more useful.

001

0

Founder setting at column one

Num of tot. breakpoints up to current column

011

0

c1

c3001

2

010

1

c2001

001

100

0001

011

1001

101

0001

110

2001

010

5

On-line computation:

Compute partial solution up to the current column for speedup.

010

001

Founder settings up to column 3

The founder-known method can be run with partially-known founders!

Assume three founders

Page 11: Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007

Dropping Search Paths that are Beaten by Another Search Path

001

0

011

6

P1 and P2 are two search paths up to column 2.

Can we say P1 is better than P2? Not really, because maybe P2 can lead to fewer breakpoints later on.

But, suppose the number of input sequences is 5. We can then say P1 beats P2 (and so drop P2). Why?

P1

P2

<=39<= 5 bkpts

>= 0 bkpts

An optimal search path following P2

40

Assume three founders

011

101

Founder Config.

Page 12: Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007

A More Powerful Beaten Rule

001

0

011

4

P1

P2

Still five input rows. Now can not say P1 beats P2. But remember we have founder matching…

5

4

3

2

1

MatchRows

5

4

3

2

1

MatchRows

So P1 beats P2 since at most 3 rows need extra breakpoints to get onto a path from P2, and P2 uses 4 more breakpoints than P1.

These two rows have the same founder mappings.

P1 Row2

P2 Row2

No extra breakpoints at rows 2 and 4

If no bkpt at P2, no bkpt at p1 too.011

001

Page 13: Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007

How Practical Is Our Method?Source of data and image: UNC Chapel Hill

Five founders

20 rows, 36 columns

UNC’s heuristic solution: 54 breakpoints

Enumerating 2180 founder states is impossible!

Our method takes 5 minutes to find the optimal solutions: 53 breakpoints. It is also practical for 50x50 matrix with four founders.

Page 14: Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007

14

Open Problems and Software

• Is the minimum mosaic problem NP-complete?

• Is there a polynomial-time algorithm for the minimum mosaic problem for small (say three to ten) number of founders?

• Software available at: http://wwwcsif.cs.ucdavis.edu/~wuyu

• Thank you.