a linear-time algorithm for the perfect phylogeny haplotyping

27
1 A Linear-Time Algorithm A Linear-Time Algorithm for the Perfect Phylogeny for the Perfect Phylogeny Haplotyping (PPH) Problem Haplotyping (PPH) Problem Zhihong Ding, Vladimir Zhihong Ding, Vladimir Filkov, Dan Gusfield Filkov, Dan Gusfield Department of Computer Science Department of Computer Science University of California, Davis University of California, Davis RECOMB 2005 RECOMB 2005

Upload: ngodieu

Post on 14-Feb-2017

219 views

Category:

Documents


2 download

TRANSCRIPT

1

A Linear-Time Algorithm A Linear-Time Algorithm for the Perfect Phylogeny for the Perfect Phylogeny

Haplotyping (PPH) ProblemHaplotyping (PPH) Problem

Zhihong Ding, Vladimir Filkov, Zhihong Ding, Vladimir Filkov, Dan GusfieldDan Gusfield

Department of Computer ScienceDepartment of Computer ScienceUniversity of California, DavisUniversity of California, Davis

RECOMB 2005RECOMB 2005

2

Haplotypes to GenotypesHaplotypes to Genotypes Each individual has two “copies” of Each individual has two “copies” of

each chromosome. each chromosome. At each site, each chromosome has At each site, each chromosome has

one of two states denoted by 0 and 1one of two states denoted by 0 and 1 From haplotypes to genotypes: From haplotypes to genotypes: For each site of an individual, if both For each site of an individual, if both

haplotypes have state 0, then the genotype haplotypes have state 0, then the genotype has state 0. Same rule for state 1. If two has state 0. Same rule for state 1. If two haplotypes have state 0 and 1, or 1 and 0, haplotypes have state 0 and 1, or 1 and 0, then the state of the genotype is 2. then the state of the genotype is 2.

3

Haplotypes to GenotypesHaplotypes to Genotypes

0 1 1 1 0 0 1 1 0

1 1 0 1 0 0 1 0 0

2 1 2 1 0 0 1 2 0

Two haplotypes per individual

Genotype for the individual

Merge the haplotypes

Sites: 1 2 3 4 5 6 7 8 9

4

Genotypes to HaplotypesGenotypes to Haplotypes

0 1 1 1 0 0 1 1 0

1 1 0 1 0 0 1 0 0

2 1 2 1 0 0 1 2 0

Two haplotypes per individual

Genotype for the individual

For each site, if the genotype has state 0 or 1, then the two haplotypes must have states 0, 0 or 1, 1. If the genotype has state 2, the two haplotypes can either have states 0, 1 or 1, 0.

5

Haplotype Inference Haplotype Inference ProblemProblem

For disease association studies, haplotype For disease association studies, haplotype data is more valuable than genotype data, data is more valuable than genotype data, but haplotype data is harder and more but haplotype data is harder and more expensive to collect than genotype data.expensive to collect than genotype data.

Haplotype Inference ProblemHaplotype Inference Problem: Given a : Given a set of set of nn genotypes, determine the original genotypes, determine the original set of set of nn haplotype pairs haplotype pairs that generated the that generated the nn genotypes. genotypes.

NIH leads HAPMAP project to find NIH leads HAPMAP project to find common haplotypes in the human common haplotypes in the human population.population.

6

Haplotype Inference Haplotype Inference ProblemProblem

If the genotype has state 2 at If the genotype has state 2 at kk sites, there are 2sites, there are 2k k –– 11 possible possible explaining haplotype pairs.explaining haplotype pairs.

How to determine which How to determine which haplotype pair is the original one haplotype pair is the original one generating the genotypegenerating the genotype??

We need a model of haplotype We need a model of haplotype evolution to help solve the evolution to help solve the haplotype inference problem.haplotype inference problem.

7

The Perfect Phylogeny The Perfect Phylogeny Model of Haplotype Model of Haplotype

EvolutionEvolution000001

2

4

3

510100

1000001011

00010

01010

12345sitesAncestral haplotype

Extant haplotypes at the leaves

Site mutations on edges

8

Assumptions of Perfect Assumptions of Perfect Phylogeny ModelPhylogeny Model

No recombination, only No recombination, only mutation.mutation.

Infinite-site assumption: one Infinite-site assumption: one mutation per site.mutation per site.

9

The Perfect Phylogeny The Perfect Phylogeny HaplotypingHaplotyping

(PPH) Problem(PPH) ProblemGiven a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny

11 22aa 22 22bb 00 22cc 11 00

11 22aa 11 00aa 00 11bb 00 00bb 00 11cc 11 00cc 11 00

1

c c a a

b

b

2

10 10 10 01 01

00

Genotype matrix

Haplotype matrix Perfect phylogeny

Site

10

Prior WorkPrior Work Several existing algorithms that Several existing algorithms that

solve the PPH problem, but none solve the PPH problem, but none of them is in linear time.of them is in linear time.

Our contribution:Our contribution: A linear time algorithm.A linear time algorithm. Our implementation is about 250 Our implementation is about 250

times faster than the fastest one of times faster than the fastest one of previous algorithms for large data set.previous algorithms for large data set.

11

A P-Class of PPH A P-Class of PPH SolutionsSolutions

11 22

3355

44

Genotype Genotype MatrixMatrix

2 2 2 2 2 2 0 0 2 0 0 0 2 0 0 2 2 2 0 2 2 2 2 2 0 2 2 2 0 2 2 0 0 2 2 0 0 2

00

One PPH One PPH SolutionSolution

rooroott

P-Class: Maximum common P-Class: Maximum common subgraph in all PPH solutionssubgraph in all PPH solutions

Each P-Class consists of two Each P-Class consists of two subtreessubtrees

Sites: 1 2 3 Sites: 1 2 3 4 54 5

GenotypGenotypeses

aa

bb cc

dd

a,d

a,c

b,d

b,c

12

P-Class Property of PPH P-Class Property of PPH SolutionsSolutions

Second PPH Second PPH SolutionsSolutions

All PPH solutions can be obtained by All PPH solutions can be obtained by choosing how to flip each P-Class.choosing how to flip each P-Class.

One PPH One PPH SolutionSolution

11 22

3355

44rooroo

tt

a,d

a,cb,c

b,d22

33

44

a,cb,d

rooroott11

a,d55

b,c

SwitchiSwitching ng pointpointss

SwitchiSwitching ng pointpointss

13

The Key TheoremThe Key Theorem Every PPH solution can be obtained Every PPH solution can be obtained

by choosing a flip for each P-Class.by choosing a flip for each P-Class. Conversely, after fixing one P-Class, Conversely, after fixing one P-Class,

every distinct choice of flips of P-every distinct choice of flips of P-Classes, leads to a distinct PPH Classes, leads to a distinct PPH solution.solution.

If there are If there are kk P-Classes, there are P-Classes, there are 22k k –– 1 1 distinct PPH solutions. distinct PPH solutions.

14

Shadow TreeShadow Tree Contains classesContains classes Each class in the shadow tree is a Each class in the shadow tree is a

subgraph of a P-Classsubgraph of a P-Class Merging classes results in larger Merging classes results in larger

classes, classes are never splitclasses, classes are never split Contains tree edges and shadow Contains tree edges and shadow

edgesedges

15

The AlgorithmThe Algorithm Process the genotype matrix Process the genotype matrix

one row at a time, starting at one row at a time, starting at the first row, and modify the the first row, and modify the shadow treeshadow tree

The genotype matrix only The genotype matrix only contains entries of value 0 and contains entries of value 0 and 2.2.

16

Overview of the Algorithm Overview of the Algorithm for One Rowfor One Row

Procedure FirstPathProcedure FirstPath Procedure SecondPathProcedure SecondPath Procedure FixTreeProcedure FixTree Procedure NewEntriesProcedure NewEntries

17

OldEntryListOldEntryList

Genotype Genotype MatrixMatrix

2 2 2 0 2 2 2 0 0 2 0 0 0 2 0 0 2 2 2 2 2 2 2 2 2 0 2 2 0 2 2 2 0 0 2 00 0 2 0

OldEntryList for OldEntryList for row row 33: : 11, , 22, , 33, , 55

OldEntryList : column indices that OldEntryList : column indices that have entries of value 2 in this row and have entries of value 2 in this row and also have entries of value 2 in some also have entries of value 2 in some previous rowsprevious rows

33

18

Procedures FirstPath and Procedures FirstPath and SecondPathSecondPath

FirstPathFirstPath : Construct a first path : Construct a first path towards the root of the shadow tree towards the root of the shadow tree which passes through tree edges of as which passes through tree edges of as many columns in OldEntryList as many columns in OldEntryList as possiblepossible

SecondPathSecondPath : Construct a second path : Construct a second path towards the root of the shadow tree towards the root of the shadow tree which passes through tree edges of which passes through tree edges of columns in OldEntryList and not on the columns in OldEntryList and not on the first pathfirst path

19

Shadow Tree After Shadow Tree After Processing the First Two Processing the First Two

RowsRows rootroot

11 11

44

55

22

33

Genotype Genotype MatrixMatrix 2 2 2 0 2 2 2 0

0 2 0 0 0 2 0 0 2 2 2 2 2 2 2 2 2 0 2 2 0 2 2 2 0 0 2 00 0 2 0

33

1122

OldEntryList for OldEntryList for row 3 : row 3 : 11, , 22, , 33, , 55

22

33

44

55

20

Algorithm – FirstPathAlgorithm – FirstPathrootroot

11 11

44

55

22

33

22

33

44

55

OldEntryLOldEntryList:ist:CheckListCheckList: : 33

, , 2222,, 33,, 5511,,

Edges Edges 44 and and 55 cannot be cannot be on the same on the same path to the path to the root in any root in any PPH solutionPPH solution

21

Algorithm – SecondPathAlgorithm – SecondPathrootroot

11 11

44

55

22

33

22

33

44

55

CheckLCheckList: ist:

33

OldEntryList: OldEntryList: 11, , 22, , 33, , 55 22

,,

22

Shadow Tree to PPH Shadow Tree to PPH SolutionsSolutions

rootroot

11 11

44

55

22

33

22

33

44

55

Genotype Genotype MatrixMatrix

2 2 2 0 2 2 2 0 0 2 0 0 0 2 0 0 2 2 2 2 2 2 2 2 2 02 0 22 2 2 0 0 2 00 0 2 0

One PPH One PPH SolutionSolution

Sites: 1 2 3 Sites: 1 2 3 4 54 5aa

bb

cc

dd

Final shadow treeFinal shadow tree

11

55

22

3344

23

Shadow Tree to PPH Shadow Tree to PPH SolutionsSolutions

rootroot1111

44

55

22

33

22

33

44

55Second PPH Second PPH

SolutionSolutionFinal shadow treeFinal shadow tree

55

33

11

2244a,da,d

b,cb,c

b,db,da,ca,c

24

Implementation – Leaf Implementation – Leaf CountCount

Leaf count of column Leaf count of column ii (L[(L[ ii ]): the number of 2's ]): the number of 2's plus twice the number of plus twice the number of 1's in column 1's in column ii..

L[L[ ii ] is the number of ] is the number of leaves below mutation leaves below mutation ii, in , in everyevery perfect phylogeny perfect phylogeny for the genotype matrix.for the genotype matrix.

Along Along anyany path to the root path to the root in in anyany PPH solution, the PPH solution, the successive edges are successive edges are labeled by columns with labeled by columns with strictly increasing leaf strictly increasing leaf counts.counts.

11 22 33 44aa 11 11 00 00bb 00 22 22 00cc 22 00 22 00dd 22 00 00 22

4 3 2 1Leaf Count:

25

Time ComplexityTime Complexity Constant number of simple Constant number of simple

operations on each edge per rowoperations on each edge per row Each traversal in the shadow tree Each traversal in the shadow tree

goes through O(goes through O(mm) edges.) edges. The algorithm does constant The algorithm does constant

number of traversals in the number of traversals in the shadow tree for each row.shadow tree for each row.

Total time: O(Total time: O(nn mm))n, m are the number of rows and columns in the genotype matrix.

26

ResultsResults

Average Running Times (seconds)

Sites (m)

Individuals (n) Dataset DPPH O(nm2) Our Alg. O(nm)

300 150 30 1.07 0.05

500 250 30 5.72 0.13

1000 500 30 45.85 0.48

2000 1000 10 467.18 1.89

27

Thank you !Thank you !

Paper and program can be Paper and program can be downloaded at:downloaded at:

http://wwwcsif.cs.ucdavis.edu/~gusfield/lpph/http://wwwcsif.cs.ucdavis.edu/~gusfield/lpph/