some evolutionary tree reconstruction problems in computational biology

78
Some evolutionary tree reconstruction problems in computational biology Chen Yen Hung Taipei Municipal University of Education http://www.gfxtra.com/dl/texture+tree+

Upload: sutton

Post on 23-Feb-2016

56 views

Category:

Documents


0 download

DESCRIPTION

http://www.gfxtra.com/dl/texture+tree+pines. Some evolutionary tree reconstruction problems in computational biology. Chen Yen Hung Taipei Municipal University of Education. Outline. Introduction to Bioinformatics and Computational Biology - PowerPoint PPT Presentation

TRANSCRIPT

1

Some evolutionary tree reconstruction problems in computational biologyChen Yen HungTaipei Municipal University of Educationhttp://www.gfxtra.com/dl/texture+tree+pines

Introduction to Bioinformatics and Computational BiologyIntroduction to the evolutionary tree reconstruction problemsTree Alignment ProblemsSteiner Tree Problems (STP)Full Sibling Reconstruction problemsOur algorithms for solving these problemsConclusions OutlineAlgorithms (Program), Parallel algorithms, Data base, Files, MemoryNetworks, Communication, Mobile phone

3Computational Biology and Bioinformaticshttp://commons.wikimedia.org/wiki/File:Rat_eating_or_praying%3F.jpghttp://commons.wikimedia.org/wiki/File:Alien.png

Computational Biology and BioinformaticsBioinformatics is the application of statistics, applied mathematics and computer science to solve biological problems. Computational biology is an interdisciplinary field that applies the techniques of computer science, applied mathematics and statistics to address biological problems. The main focus lies on developing mathematical modeling, computational simulation techniques and algorithm design and analysis.http://en.wikipedia.org/wiki/Bioinformaticshttp://en.wikipedia.org/wiki/Computational_biology

DNA2 3D N AD N AAGCTDNA1 9 5 3Crick and WatsonNatureD N ARef: / http://web1.nsc.gov.tw/ct.aspx?xItem=8270&ctNode=40&mp=1http://www.ornl.gov/sci/techresources/Human_Genome/graphics/slides/images/molecularmachine.jpg

23 DNA DNA DNA DNA 32

Celera 1 6,500 2000 3 8 67Sequence/Structure SimilaritySequence/Structure HomologyFunctional Conservation()Phylogenetic Analysis()

Protein Structure:: Red: Helix Yellow: Sheet Other: coil

: Viagra

http://commons.wikimedia.org/wiki/File:Pancreatic_lipase%E2%80%93colipase_complex_with_inhibitor_1LPB.png

Protein StructureProteomics Center Amazon EC2 Amazon S3(proteomics)ViPDAC(http://proteomics.mcw.edu/vipdac) ViPDAC web interfacespectra in .mgfAmazon S3Amazon EC2clusterAmazon S39http://www.bcrc.firdi.org.tw/detail_news.do?newsid=213931386

Drug Designhttp://commons.wikimedia.org/wiki/File:Aldose_reductase_1us0.pnghttp://commons.wikimedia.org/wiki/File:Quinacrine_mustard_in_Trypanothione_reductase_active_site.pnghttp://www.iidmm.uct.ac.za/sturrock/research.htm

What do Computer Scientists do?model(/)

(NP-hard)

What do Computer Scientists do?Ref: / http://web1.nsc.gov.tw/ct.aspx?xItem=8273&ctNode=40&mp=1

Sequencing ProblemInput :

Output : AGACTAGTCTGTATAGACTAGCCT

Reduction: Maximum independent set in the interval graph

Intractable problemsWe will learn how to mathematically characterize the difficulty of computational problems.There is a class of problems that can be solved in a reasonable amount of time and another class that cannot (What good is it for a problem to be solvable, if it cannot be solved in the lifetime of the universe?)The field of cryptography, for example, relies on the fact that the computational problem of breaking a code is intractable

CMI Millennium PrizeClay Mathematics Institute (Cambridge, MA, USA) offered US$1,000,000 for each of seven open problems on May 24, 2000 at Paris.| Birch and Swinnerton-Dyer Conjecture | Hodge Conjecture | Navier-Stokes Equations | P vs NP | Poincare Conjecture | Riemann Hypothesis | Yang-Mills Theory | http://www.csie.ntu.edu.tw/~hil/teach.html

Grigory Perelman 1966 ~ Russian mathematicianSolved the Poincar Conjecture in 2003Fields medalist, 2006Declined to accept the awardhttp://www.csie.ntu.edu.tw/~hil/teach.htmlhttp://elsecretodezara.blogspot.com/2008/09/el-hombre-mas-inteligente-del-mundo.html

How hard is NP-complete?So far, scientist only knows how to solve such a NP-complete problem in O(cn) time for some constant c. (NP-complete)http://www.iis.sinica.edu.tw/~hil/random/ra20040225.ppthttp://www.iis.sinica.edu.tw/~hil/wput/approx-mcu-2004.ppt

On the bright sideIf you can come up with an algorithm for such a problem that runs in O(n1000000) time, then you will be awarded Turing Award for sure plus US$1,000,000 from CMI.

http://www.iis.sinica.edu.tw/~hil/random/ra20040225.ppt

NP-completeness = Dead end ?http://www.iis.sinica.edu.tw/~hil/random/ra20040225.ppt

http://activerain.com/blogsview/108625/realtor-killed-in-the-basementNP-completeness = Dead end ?Or maybe we can settle for good algorithms? 1. Heuristic algorithms 2. Approximation algorithms 3. Randomized algorithms http://www2.ee.ntu.edu.tw/~yen/courses/al-01/approximation.ppt

ApproximationAn algorithm which returns an answer C ( C) which is close to the optimal solution C*(C*) is called an approximation algorithm.Closeness is usually measured by the ratio bound (n) the algorithm produces. Which is a function that satisfies, for any input size n, max{C/C*,C*/C}(n). : C*C (n)C*, (n)>1: CC* (n)C, (n)>1http://www2.ee.ntu.edu.tw/~yen/courses/al-01/approximation.ppt

Approximation AlgorithmCriterion 1: feasibility ()Always output a feasible solution.Criterion 2: tractability ()Always runs in polynomial time.Criterion 3: quality ()The solutions quality is always provably not too far from that of an optimal solution.http://www.iis.sinica.edu.tw/~hil/random/ra20040225.ppt

22Evolutionary Tree Reconstruction in Biologyhttp://commons.wikimedia.org/wiki/File:Human-evolution.jpghttp://commons.wikimedia.org/wiki/File:Phylogenetic_tree_of_Theropods_respiratory_system_01.JPG

Evolutionary Tree Reconstruction in BiologyTo reconstruct the evolutionary tree of the extinct species from present-day species.Tree structure can be given (from the inference or previous known data) or unknown.The tree structure can be rooted or unrooted.

http://insystemicthinking.wordpress.com/2007/12/11/funny-you-dont-look-different/http://darwinshealthclass.blogspot.com/2011/02/my-evolution.htmlhttp://www.niu.edu/pubaffairs/releases/2000/mar/primate/tree.html

Phylogenetic (Evolutionary) Tree Reconstruction Problem Input : present-day species (DNA sequences) & tree structure (alternative)

Output : an evolutionary tree

Distance of two species (sequences): evolutionary time or some distance metrics such as Hamming distance, Levenshtein (Edit) distance.

Goal : Depends on the distance metrics (Ex: MiniMax (Bottleneck), MiniSum, MiniSize)

Tree Alignment Problem (TAP)Input : a set W of n sequences (strings) and a tree structure T with n leaves, each of which is labeled with a unique sequence in W

Output : To label a sequence to each internal vertex of T .

The distance on a edge of the tree is defined Edit distance between the two sequences which labeled to the two ends of a edge.Goal : to find a tree alignment such that the sum of the Edit distance of all its edges is minimized.

Levenshtein (Edit) distanceAn alphabet is a non-empty set of symbols. Given alphabet , the set * of all finite length sequences of symbols from . Given two sequences w and w, we say w rewrites into w in one step if one of the following correction rules holds:

(1) w=axb w=ab, and a,b *, x (single-symbol deletion x in w)(2) w=ab, w=axb ,and a,b *, x (single-symbol insertion x in w) (3) w=axb, w=ayb ,and a,b*, x,y, xy (single-symbol substitution)

Levenshtein (Edit) distanceThe Levenshtein (Edit ) distance between w and w denoted as d(w,w) is the smallest steps such that w rewrites into w.

Ex: d(xyxx, xxy)=2. A deletion of y from xyxx plus a substitution of last x by y.

Bottleneck Tree Alignment Problem (BTAP)Input : a set W of n sequences (strings) and a tree structure T with n leaves, each of which is labeled with a unique sequence in W

Output : To label a sequence to each internal vertex of T .

Goal : to find a tree alignment such that the edit distance of the largest edge is minimized.

An Example of the Tree Alignmentw2w3w4w5w1 W={TGC, ATGC, A, TGCG, AATT, TTATT}

The total distance : The bottleneck distance: Distance : Levenshtein (Edit) DistancewawbwcT0131121103w6wa=TGC1TAPwb=TATTwc=TGTBTAPwa=AGC122wb=TATTwc=AGT112

Our resultsNP-complete for the bottleneck tree alignment problem.

An O(n3+n2L2 ) - time for the bottleneck lifted tree alignment problem, where L is the maximum lengths of the sequences in W.

An O(nL2) - time for the bottleneck tree alignment problem when the distances function satisfy ultrametric.

Steiner Tree Problem in GraphsInput: a graph G=(V,E) with a length function d: ER+ and a set of terminals R VOutput: a tree of G spanning all vertices in RObjective: minimize the length of T (min-sum)

An Example of STPG(V,E), R={a, b, g, f} : Steiner vertex: Terminal vertex1bacde h254131311232fg3 i1

1bacde h254131311232fg3An Example of STP i1

Other PurposeMulticastinghttp://www.cisco.com/en/US/tech/tk828/tech_brief09186a00800a4415.htmlTo finding an optimal solution of the STP is NP-complete.

Previous Results for STPConstant ratio approximation algorithms:2 [ Hwang, Richards & Winter, 1992] 11/6 by Zelikovsky [Algorithmica93]16/9 by Berman & Ramayer [Journal of Algorithms94]1.73 by Borchers & Du [SIAM Journal on Computing97]5/3 by Promel & Steger [Journal of Algorithms00]1.64 by Karpinski & Zelikovsky [Journal of Combinatorial Optimization97]1.59 by Hougardy & Promel [SODA99]1+ln 3/2 1.55 by Robins & Zelikovsky [SIAM DM05]ln4+ 1.39 by Byrka, Grandoni, Rothvob, Sanit [STOC 10]Cannot be approximated better than 1.006 by Thimm [TCS03]

COCOON 2003 @ Big Sky MT36Terminal Steiner Tree ProblemTerminal Steiner tree: a Steiner tree with all terminals as its leavesTerminal Steiner tree problem(TSTP):Input: a complete graph G=(V,E) with a length function d: ER+ and a set of terminals R VOutput: a Terminal Steiner tree for R in G Objective: minimize the length of T (min-sum)

An Example of TSTPG(V,E): A Complete GraphR={a, b, g, f} : Steiner vertex: Terminal vertex1bacde h254131311232fg3 i1

To finding an optimal solution is NP-complete. 1bacde h254131311232fg3An Example of TSTP i1

Previous Results for TSTPPerformance ratios of approximation algorithms: +2 by Lin & Xue [IPL02] for the case in which the length function is metric (i.e., satisfying the triangle inequality), where is best-known performance ratio for the STP8/5 by Lu, Tang & Lee [TCS03] for the special case in which the edge lengths are either 1 or 2Our result for TSTP (with metric length function) : 2-approximation algorithm [cocoon03]2-()/ (3-2) 2.515-approximation algorithm by Martineza, Pina, Soares [TCS07]

Bottleneck Steiner Tree ProblemInput: a graph G=(V,E) with a length function d: ER and a set of terminals R VOutput: a tree of G spanning all vertices in RObjective: minimize the length of the largest edge in T (min-max)Previous time-complexity: O(|N|2+|E|loglog|E|) time by Chiang, Sarrafzadeh & Wong [IEEE on CAD90 ]O(|E|) time by Duin & Volgenant [EJOR97]

Bottleneck Terminal Steiner Tree ProblemBottlencek Terminal Steiner Tree problemInput: a complete graph G=(V,E) with a length function d: ER+ and a set of terminals R VOutput: a full Steiner treeObjective: minimize the length of the largest edge in T (min-max)Our result for BTSTP : O (|E|log |E|) time

An Example of BTSTPAB12310Input graph64739128AB247Optimal solution for TSTPOptimal solution for BTSTP

Approximation Algorithm for TSTPWe assume that G contains no edge between any two terminals.We apply the best-known approximation algorithm for the STP and obtain a Steiner tree SAPX for R in G.If all vertices of R are leaves in SAPX, then SAPX is a terminal Steiner tree of G; otherwise, we use the following Algorithm 1 to transform it into a terminal Steiner tree.

NG(r): the set of the neighbors of r R in GNote that the members of N(r) are all Steiner vertices, because we assume that G contains no edge between any two terminals.NGS(r): the nearest neighbor of r in GThat is, d(r, NGS(r)) = min{d(r, v) | vNG(r) }.D(NGS(R)): the sum of the lengths of all the edges of NGS(r) of rR in G1bacde h254131311232fg3Some Notation for Algorithm 1 i1

Some Notation for Algorithm 1N(r): the set of the neighbors of r R in SAPXNote that the members of N(r) are all Steiner vertices, because we assume that G contains no edge between any two terminals.N1(r): the nearest neighbor of r in SAPXThat is, d(r, N1(r)) = min{d(r, v) | vN(r) }.N2(r): the second nearest neighbor of r in SAPXFor example,N(r)321rN1(r)

Algorithm 1/* To transform SAPX into a terminal Steiner tree */For each r R with |N(r)| 2 in SAPX doRemove all the edges in star(r) \ { (r, N1(r))} from SAPXstar(r): the subtree of SAPX induced by {(r, v)|v N(r) }Find a minimum spanning tree MST(N(r)) of G[N(r)] and add all the edges of MST(N(r)) into SAPXG[N(r)]: the subgraph of G induced by N(r)End Forr2N1(r2)r1N1(r1)star(r1)star(r2)

Approximation Ratio 1Let Topt and Sopt be the optimal terminal Steiner tree and Steiner tree in G, respectively.len(Sopt) len(Topt) since Topt is also a Steiner tree.len(SAPX) len(Sopt), since SAPX is obtained by the currently best-known approximation algorithm for the STP whose performance ratio is .len(MST(N(r))) 2 len(star(r)) - d(r,N1(r) - d(r,N2(r)) by triangle inequality.len(TAPX1) = len(SAPX) + r R (len(MST(N(r))) - len(star(r)) + d(r,N1(r))) 2 len(Sopt)- D(NGS(R))N1(r2)N1(r1)star(r1)r1r2

gApproximation Ratio 21bacde h254131311232fg3 i1+3*1+3*2NG(r): the set of the neighbors of r R in GNote that the members of N(r) are all Steiner vertices, because we assume that G contains no edge between any two terminals.NGS(r): the nearest neighbor of r in GThat is, d(r, NGS(r)) = min{d(r, v) | vNG(r) }.D(NGS(R)): the sum of the lengths of all the edges of NGS(r) of rR in Gd(u,v)={ d(u,v)+3d(u,NGS(r)), if uR and vNG(r) d(u,v) , otherwise

+3*2+3*1+3*1+3*1+3*1+3*1+3*1+3*1+3*1

Algorithm 2/* To transform SAPX into a terminal Steiner tree */For each r R with |N(r)| 2 in SAPX doRemove all the edges in star(r) from SAPXstar(r): the subtree of SAPX induced by {(r, v)|v N(r) }Find a minimum spanning tree MST(N(r)) of G[N(r)] and add all the edges of MST(N(r)) into SAPXG[N(r)]: the subgraph of G induced by N(r)Add all the edges in (r,NGS(r)) End Forr2NGS(r2)r1NGS(r1)star(r1)star(r2)

Approximation Ratio 2len(TAPX2) len(Sopt)+3D(NGS(R))

Approximation Ratio 3len(TAPX2) len(Sopt)+3D(NGS(R))len(TAPX1) 2 len(Sopt)-D(NGS(R))Select a minimum length terminal Steiner tree TAPX between TAPX1 and TAPX2When D(NGS(R))= ()/ (3-2) len(Sopt)len(TAPX) 2 len(Sopt)- ()/ (3-2) len(Sopt)

Full Sibling Reconstructiongiven children in wild population without known parentsgroup them into brothers and sisters (siblings)

4

Problem: The Full Sibling Reconstruction Problem

nNlocuslocus

n1

4 (or2)

\Locuslocus1locus21149/167243/2552149/155245/2673149/177245/2534155/155253/2535149/155245/2676149/155245/2777149/151251/2558149/173255/255Group1{1,7,8}Group2{2,3,4,5,6}alleles

2149/155245/2673149/177245/2534155/155253/2535149/155245/2676149/155245/2771149/167243/2557149/151251/2558149/173255/255Locus1 Locus2Locus1 Locus2Locus 1 3Alleles = {149, 155, 177} Locus 24Alleles = {245, 253, 267, 277}4 Locus 14Alleles = {149, 151, 167, 173} Locus 23Alleles = {243, 251, 255} 4

Mendelian Constrains4-allele rule,4(a/b)155/155, 149/155, 149/151, 149/173{155,149,151,173} 4

2-allele rule: (half sibling)155/155, 149/155, 149/151a={149,155}2b={155,151}2

58Full Sibling ReconstructionSimple Mendelian inheritance rules

father (...,...),(p,q),(...,...),(...,...) (...,...),(r,s),(...,...),(...,...) mother

(...,...),(...,...),(...,...),(...,...) child

Siblings: two children with the same parentsQuestion: given a set of children, can we find the sibling groups?locusalleleone from fatherone from motherhttp://www.cs.uic.edu/~dasgupta/slides.html

weaker enforcement of Mendelian inheritance

4-allele property

father (...,...),(p,q),(...,...),(...,...) (...,...),(r,s),(...,...),(...,...) mother

(...,...), (...,...), (...,...), (...,...)

(...,...), (...,...), (...,...), (...,...)

(...,...), (...,...), (...,...), (...,...)

(...,...), (...,...), (...,...), (...,...)

(...,...), (...,...), (...,...), (...,...) siblingsone from fatherone from motherat most 4 alleles in this locushttp://www.cs.uic.edu/~dasgupta/slides.html

stricter enforcement of Mendelian inheritance

2-allele property

father (...,...),(p,q),(...,...),(...,...) (...,...),(r,s),(...,...),(...,...) mother

(...,...), (...,...), (...,...), (...,...)

(...,...), (...,...), (...,...), (...,...)

(...,...), (...,...), (...,...), (...,...)

(...,...), (...,...), (...,...), (...,...)

(...,...), (...,...), (...,...), (...,...) siblingsfrom fatherfrom motherif we reorder such that left is from father and right is from motherthen the left column of the locus has at most 2 allelesand the same for the rightcolumnhttp://www.cs.uic.edu/~dasgupta/slides.html

Summary of results

2-allelen, and 4-allelen, a, the maximum size of any sibling group a=3, =O(n3) : (1+)-inapproximable assuming RP NPa=3, any : (7/6)+-approximation

a=4, =2 : (1+)-inapproximable assuming RP NPa=4, any : (3/2)+-approximation

a=n, =O(n2) : (n)-inapprox assuming ZPP NP 0 < < < 1

Algorithm4-alleleConstruct possible sets S1, S2, , Sm that satisfy 4-allele rule (must exist since each pair of individuals forms a valid set)loc1loc2 loc1loc2ind11/12/3set(1,2) = {1,4}{2,3,5,6}ind21/45/6

For each individual x add it to Sj only if itits alleles for each locus are in the set of alleles for that locus in Sj

Find minimum set cover from sets S1, S2, , Sm of all the individuals. Return sets in the cover as sibling groups

63ExampleLocus 1Locus2A1/12/3B1/45/6C1/42/6D7/89/6Locus1Locus2Set A,B{1,4}{2,3,5,6}Set A,C{1,4}{2,3,6}Set A,D{1,7,8}{2,3,6,9}Set B,C{1,4}{2,5,6}Set B,D{1.4,7,8}{5,6,9}Set C,D{1,4,7,8}{2,6,9}ABCDSet A,B ={A,B,C}Set A,C ={A,C}Set A,D ={A,D}Set B,C ={B,C}Set B,D ={B,D}Set C,D ={C,D}

Half Sibling ReconstructionGroup1{1,2,3,4}Group2{5}\Locuslocus1locus217/819/2027/1020/4635/619/2344/515/1952/1015/19locus,allele A={a,b},ilocusallele{ui,vi} uiAvi A

52/1015/19Locus1 Locus2Locus 2half-sibling Locus 1half-sibling Half Sibling ReconstructionLocus 1 2Alleles = {5,7} Locus 24Alleles = {19,20}4 \Locuslocus1locus217/819/2027/1020/4635/619/2344/515/19NP-Complete and APX-hard

Half-sibs min set cover algorithmEnumerate all maximal feasible half-sibling sets S in the cohort U that obey the Half-Sibs Property. (C(2n,2))kFind minimum set cover from sets S1, S2, , Sm of all the individuals. Return sets in the cover as sibling groups

k partition-distance problem

IntroductionClustering (partitioning) is a fundamental and import research.A partition consists of dividing the set of elements into two or more disjoint clusters (non-empty subsets of N) such that each element belongs to exactly one cluster.Different partitioning algorithms for the same application will produce different partitions from the same set of elements.How to assess the partitions and find a goodness partition are important and interesting problems when partitions have been generated.How to compute the distance of these partitions is the purpose of my paper We focus the partition-distance which introduced by Almudevar and Field; Gusfield; Konovalov, Litow, Bajema

69ExampleElement set N={0,1,2,3,4,5,6,7,8,9}, 2 partitions P1 and P2P1 consists of clusters C1,1, C1,2 and C1,3P2 consists of clusters C2,1, C2,2, C2,3P1P2C1,1={ , , , , }C2,1={ , }C1,2={ , }C2,2={ , }C1,3={ , , }C2,3={ , , , , , }12894536712458903670

70The k partition-distance (k-PD) problemInput: A set of elements N and k partitions P={P1,P2,,Pk}, k 2. A partition consists of dividing these elements into two or more disjoint clusters such that each element belongs to exactly one cluster

Goal: Delete the minimum number of elements from each partition in P such that all remaining partitions become identical.For two partitions Pu and Pv, the two partitions are identical if and only if every cluster in Pu maps to the same cluster in Pv (the converse is then forced).

71ExampleElement set N={0,1,2,3,4,5,6,7,8,9}, 2 partitions P1 and P2P1 consists of clusters C1,1, C1,2 and C1,3P2 consists of clusters C2,1, C2,2, C2,3P1P2C1,1={ , , , , }C2,1={ , }C1,2={ , }C2,2={ , }C1,3={ , , }C2,3={ , , , , , }12894536712458903670Output:Delete elemsnts:{5,8,9,0}Distance of the two partitions: 4

72ExampleElement set N={0,1,2,3,4,5,6,7,8,9}, 3 partitions P1, P2 , P312894536712458903670Output:Delete elemsnts:{4,5,8,9,0}Distance of the three partitions: 5P1P2P3C1,1={ , , , , } C2,1={ , } C3,1={ , , }C2,1={ , }C2,2={ , }C3,2={ , , , }

C2,1={ , , }

C2,3={ , , , , , }C3,3={ , , }1289045367

73Previous resultsAlmudevar and Field first mentioned 2 partition-distance problem from biologic motivations. [J. of Agricultural, Biological, and Environmental Statistics 99]They gave an exponential-time algorithm in order to find a good partition of the individuals (fish data actually) into sibling groups.Gusfield [IPL02] proposed O(c3+|N|)-time algorithms by reduction of this problem (2-PD) to the maximum assignment problem, where c is the sum of the number of clusters of both partitions.

74Previous resultsKonovalov, Litow, Bajema [Bioinformatic 05] also designed an O(c3+|N|)-time algorithm, independently by reduction of this problem to the minimum assignment problem.Gusfield [IPL02] posted a generalization of the partition-distance problem, called as the k partition-distance problem (i.e., k > 2. )Gusfield [IPL02] also showed the k partition-distance problem is NP-complete by reduction from 3-dimensional matching problem when k>2.`

75Our resultA first known 2-approximation algorithm for the k-PD problem which runs in O(k |N|), where is the maximum number of clusters of these k partitions and |N|.In worst case, our algorithm runs in O(|N|2) time, but Gusfield and Konovalov et.al. run in O(|N|3) for 2-partition distance problem.Implement our algorithm by C and PHP code.

76ConclusionWould you like to join me in this adventure of Algorithms in Bioinformatics ?

Any Comments or Questions?