greedy approximation algorithms for covering problems in computational biology

52
1 Greedy Approximation Algorithms for Covering Problems in Computational Biology Ion Mandoiu Computer Science & Engineering Department University of Connecticut

Upload: celeste-marty

Post on 30-Dec-2015

41 views

Category:

Documents


0 download

DESCRIPTION

Greedy Approximation Algorithms for Covering Problems in Computational Biology. Ion Mandoiu Computer Science & Engineering Department University of Connecticut. Why Approximation Algorithms?. Most practical optimization problems are NP-hard - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

1

Greedy Approximation Algorithms for Covering Problems in Computational Biology

Ion MandoiuComputer Science & Engineering Department

University of Connecticut

Page 2: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

2

Why Approximation Algorithms?

• Most practical optimization problems are NP-hard• Approximation algorithms offer the next best thing to an

efficient exact algorithm– Polynomial time

– Solutions guaranteed to be “close” to optimum-approximation algorithm: solution cost within a

multiplicative factor of of optimum cost

• Practical relevance: insights needed to establish approximation guarantee often lead to fast, highly effective practical implementations

Page 3: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

3

Why Computational Biology?

• Exploding multidisciplinary field at the intersection of computer science, biology, discrete mathematics, statistics, optimization, chemistry, physics, …

• Source of a fast growing number of combinatorial optimization applications:– TSP and Euler paths in DNA sequencing– Dynamic Programming in sequence alignment– Integer Programming in Haplotype inference– …

• This talk: two “covering” problems in computational biology (primer set selection and string barcoding)

Page 4: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

4

Overview

Potential function greedy algorithm- The set cover problem and the greedy algorithm

- Potential function generalization Primer Set Selection for Multiplex PCR The String Barcoding Problem Conclusions

Page 5: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

5

The Set Cover Problem

Given: - Universal set U with n elements

- Family of sets (Sx, xX) covering all elements of U

Find:- Minimum size subset X’ of X s.t. (Sx, xX’) covers all

elements of U

Greedy Algorithm: - Start with empty X’, and repeatedly add x such that Sx

contains the most uncovered elements until U is covered

Page 6: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

6

Approximation Guarantee

Classical result (Johnson’74, Lovasz’75, Chvatal’79): the greedy setcover algorithm has an approximation factor of H(n)=1+1/2+1/3+…+1/n < 1+ln(n)

- The approximation factor is tight- Cannot be approximated within a factor of (1-)ln(n) unless

NP=DTIME(nloglog(n))

Page 7: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

7

General setting

“Potential function” (X’) 0 ({}) = max

(X’) = 0 for all feasible solutions X’’ X’ (X’’) (X’) If (X’)>0, then there exists x s.t. (X’+x) < (X’) X’’ X’ ∆(x,X’) ∆(x,X’) for every x, where

∆(x,X’) := (X’) - (X’+x)

Problem: find minimum size set X’ with (X’)=0

Page 8: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

8

Generic Greedy Algorithm

• Theorem (Konwar et al.’05) The generic greedy algorithm has an approximation factor of 1+ln ∆max

X’ {} While (X’) > 0

Find x with maximum ∆(x,X’) X’ X’ + x

Page 9: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

9

Proof IdeaLet x1, x2,…,xg be the elements selected by greedy, in the order in which they are chosen, and x*1, x*2,…,x*k be the elements of an optimum solution.

Charging scheme: xi charges to x*j a cost of

where ij = ∆(xi,{x1,…, xi-1}{x*1,…,x*j})

Fact 1: Each x*j gets charged a total cost of at most 1+ln ∆max

Page 10: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

10

Proof of claim 2Fact 2: Each xi charges at least 1 unit of cost

Page 11: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

11

Overview

Potential function greedy algorithm Primer set selection for multiplex PCR

- Motivation and problem formulation- Greedy applied to primer set selection- Experimental results

The String Barcoding Problem Conclusions

Page 12: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

12

DNA Structure

• Four nucleotide types: A,C,T,G

• Normally double stranded

• A’s paired with T’s

• C’s paired with G’s

Page 13: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

13

The Polymerase Chain Reaction

Target Sequence Polymerase

Primer 1Primer 2

Primers

Repeat 20-30 cycles

Page 14: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

14

Primer Pair Selection Problem

• Given:

• Genomic sequence around amplification locus

• Primer length k

• Amplification upperbound L

• Find: Forward and reverse primers of length k that hybridize within a distance of L of each other and optimize amplification efficiency (melting temperature, secondary structure, mis-priming, etc.)

L

Forward primer

Reverse primer

amplification locus

3'

3'

5'

5'

Page 15: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

15

Multiplex PCR• Multiplex PCR (MP-PCR)

– Multiple DNA fragments amplified simultaneously

– Each amplified fragment still defined by two primers

– A primer may participate in amplification of multiple targets

• Primer set selection– Typically done by time-consuming trial and error

– An important objective is to minimize number of primers Reduced assay cost Higher effective concentration of primers higher

amplification efficiency Reduced unintended amplification

Page 16: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

16

Primer Set Selection Problem

• Given:

• Genomic sequences around n amplification loci

• Primer length k

• Amplification upper bound L

• Find:

• Minimum size set S of primers of length k such that, for each amplification locus, there are two primers in S hybridizing with the forward and reverse genomic sequences within a distance of L of each other

Page 17: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

17

Applications• Single Nucleotide Polymorphism (SNP) genotyping

– Up to thousands of SNPs genotyped simultaneously

– Selective PCR amplification required for improved accuracy

• Spotted microarray synthesis [Fernandes&Skiena’02]– Primers can be used multiple times

– For each target, need a pair of primers amplifying that target and only that target (amplification uniqueness constraint)

– Can still reduce #primers from 2n to O(n1/2)

Page 18: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

18

Previous Work on Primer Selection

• Well-studied problem: [Pearson et al. 96], [Linhart & Shamir’02], [Souvenir et al.’03], etc.

• Almost all problem formulations decouple selection of forward and reverse primers– To enforce bound of L on amplification length, select only

primers that hybridize within L/2 bases of desired target

– In worst case, this method can increase the number of primers by a factor of O(n) compared to the optimum

• [Pearson et al. 96] Greedy set cover algorithm gives O(ln n) approximation factor for the “decoupled” formulation

Page 19: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

19

Previous Work (contd.)

• [Fernandes&Skiena’02] study primer set selection with uniqueness constraints

• Minimum Multi-Colored Subgraph Problem:– Vertices correspond to candidate primers– Edge colored by color i between u and v iff

corresponding primers hybridize within a distance of L of each other around i-th amplification locus

– Goal is to find minimum size set of vertices inducing edges of all colors

• Can capture length amplification constraints too

Page 20: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

20

Integer Program Formulation• 0/1 variable xu for every vertex

• 0/1 variable ye for every edge e

Page 21: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

21

LP-Rounding Algorithm

Theorem [Konwar et al.’04]: The LP-rounding algorithm finds a feasible solution at most O(m1/2lnn) times larger than the optimum, where m is the maximum color class size, and n is the number of nodes

For primer selection, m L2 approximation factor is O(Llnn)

Better approximation?- Unlikely for minimum multi-colored subgraph problem

(1) Solve linear programming relaxation

(2) Select node u with probability xu

(3) Repeat step 2 O(ln(n)) times and return selected nodes

Page 22: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

22

Selection w/o Uniqueness Constraints• Can be seen as a “simultaneous set covering” problem:

- The ground set is partitioned into n disjoint sets Si (one for each target), each with 2L elements

- The goal is to select a minimum number of sets (i.e., primers) that cover at least half of the elements in each partition

L L

SNPi

Page 23: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

23

Greedy Algorithm

• Potential function = minimum number of elements that must be covered = i max{0, L - #uncovered elements in Si}

• Initially, = nL

• For feasible solutions, = 0

• ∆() nL (much smaller in practice)

•Theorem [Konwar et al.’05]: The number of primers selected by the greedy algorithm is at most 1+ln(nL) larger than the optimum

Page 24: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

24

Experimental Setting• Datasets extracted from NCBI databases, L=1000• Dell PowerEdge 2.8GHz Xeon• Compared algorithms

– G-FIX: greedy primer cover algorithm [Pearson et al.]

– MIPS-PT: iterative beam-search heuristic [Souvenir et al.]

• Restrict primers to L/2 bases around amplification locus

– G-VAR: naïve modification of G-FIX

• First selected primer can be up to L bases away

• Opposite sequence truncated after selecting first primer

– G-POT: potential function driven greedy algorithm

Page 25: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

25

Experimental Results, NCBI tests

#Targets

k

G-FIX(Pearson et al.)

G-VAR(G-FIX with dynamic

truncation)

MIPS-PT (Souvenir et al.)

G-POT(Potential- function

greedy)

#Primers CPU

sec

#Primers CPU

sec

#Primers CPU

sec

#Primers CPU

sec

20

8 7 0.04 7 0.08 8 10 6 0.10

10 9 0.03 10 0.08 13 15 9 0.08

12 14 0.04 13 0.08 18 26 13 0.11

50

8 13 0.13 15 0.30 21 48 10 0.32

10 23 0.22 24 0.36 30 150 18 0.33

12 31 0.14 32 0.30 41 246 29 0.28

100

8 17 0.49 20 0.89 32 226 14 0.58

10 37 0.37 37 0.72 50 844 31 0.75

12 53 0.59 48 0.84 75 2601 42 0.61

Page 26: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

26

#primers, as percentage of 2n (l=8)

n

Page 27: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

27

#primers, as percentage of 2n (l=10)

n

Page 28: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

28

#primers, as percentage of 2n (l=12)

n

Page 29: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

29

CPU Seconds (l=10)

n

Page 30: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

30

Overview

Potential function greedy algorithm Primer Set Selection for Multiplex PCR The String Barcoding Problem

- Problem Formulation

- Integer programming and greedy algorithms

- Experimental results Conclusions

Page 31: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

31

Motivation

• Rapid pathogen detection– Given

• Pathogen with unknown identity

• Database of known pathogens

– Problem• Identify unknown pathogen quickly

• Ideal solution: determine DNA sequence of unknown pathogen

Page 32: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

32

Real World

• Not possible to quickly sequence an unknown pathogen– Only have sequence for pathogens in database

• Can quickly test for presence of short substrings in unknown virus (substring tests) using hybridization

• String barcoding [Borneman et al.’01, RashGusfield’02]– Use substring tests that uniquely identify each pathogen in the

database

Page 33: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

33

String Barcoding Problem

Given:

Genomic sequences g1,…, gn

Find:

Minimum number of distinguisher strings t1,…,tk

Such that:

For every gi gj, there exists a string tl which is substring of gi or gj, but not of both

- At least log2n distinguishers needed

- Fingerprints n distinguishers

Page 34: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

34

Example

• Given sequences:1. cagtgc

2. cagttc

3. catgga

• Feasible set of distinguishers: {tg, atgga}

tg atgga

cagtgc 1 0

cagttc 0 0

catgga 1 1

Row vectors: unique barcodes for each pathogen

Page 35: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

35

Computational Complexity

• [Berman et al.’04] Cannot be approximated within a factor of (1-)ln(n) unless NP=DTIME(nloglog(n))

Page 36: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

36

Setcover Greedy Algorithm• Distinguisher selection as setcover problem

– Elements to be covered are the pairs of sequences

– Each candidate distinguisher defines a set of pairs that it separates

• Another view: covering all edges of a complete graph with n vertices by the minimum number of given cuts

• For n sequences, largest set can have O(n2) elements The setcover greedy guarantees ln(n2) = 2 ln n approximation

Page 37: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

37

Integer Program Formulation

• 0/1 variable for each candidate distinguisher• 1 candidate is selected

• 0 candidate is not selected

• For each pair of sequences, at least one candidate separating them is selected

• Objective Function– Minimize #selected candidates

Page 38: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

38

Practical Issues

• Quadratic # of constraints, huge # of variables – Genome sizes range from thousands of bases for phage and

viruses to millions for bacteria to billions for higher organisms

• Many variables can be removed:– Candidates that appear in all sequences– Sufficient to keep a single candidate among those that appear

in the same set of sequences

• How to efficiently remove useless variables?– Rash&Gusfield use suffix trees

Page 39: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

39

Suffix Tree Example

• Strings:1. cagtgc

2. cagttc

3. catgga

v1 - {1,2,3} v2 - {1,2,3} v3 - {3} v4 - {1} v5 - {3}

v6 - {1,2} v7 - {2} v8 - {1} v9 - {1,2,3} v10 - {1,2,3}

v11 - {1,2} v12 - {1} v13 - {2} v14 - {3} v15 - {1,2,3}

v16 - {2} v17 - {2} v18 - {1,3} v19 - {1} v20 - {3}

v21 - {1,2,3} v22 - {3} v23 - {2} v24 - {1,2} v25 - {1}

Page 40: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

40

Integer Program

MinimizeV18 + V22 + V11 + V17 + V8 #objective functionSuch thatV18 + V17 + V8 >= 1 #constraint to cover pair 1,2V22 + V11 + V8 >= 1 #constraint to cover pair 1,3V18 + V22 + V11 + V17 >= 1 #constraint to cover pair 2,3Binaries #all variables are 0/1V18 V22 V11 V17 V8End

tg (V18) atgga (V22)

cagtgc 1 0

cagttc 0 0

catgga 1 1

Page 41: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

41

Limitations of Integer Program Method

• Works only for small instances – 50-150 sequences– Average length ~1000 characters– Over 4 hours needed to come within 20% of

optimum!

• Scalable Heuristics?

Page 42: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

42

Distinguisher Induced Partition

• Key idea [Berman et al. 04]: Keep track of the partition defined by distinguishers selected so far

1

2

3

n-1n

Distinguisher 1

Distinguisher 2

Page 43: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

43

Information Content Heuristic

= partition entropy = log2(#permutations compatible with current partition)– Initial partition entropy = log2(n!) n log2n

– For feasible distinguisher sets, partition entropy = 0

– ∆() n :

• log2(n!) - log2(k!(n-k)!) < log2(2n) = n

• Information content heuristic (ICH) = greedy driven by partition entropy

• Theorem [Berman et al.’04] ICH has an approximation factor of 1+ln(n)

Page 44: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

44

ICH Limitations

• Real genomic data has degenerate nucleotides– Ambiguous sequencing– Single nucleotide polymorphisms

• For sequences with degenerate nucleotides there are three possibilities for distinguisher hybridization– Sure hybridization– Sure mismatch– Uncertain hybridization

No partition to work with!

Page 45: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

45

Practical Implementation• ICH and setcover greedy give nearly identical

results on data w/o non-degenerate bases

• Setcover greedy can also be extended to handle– degenerate bases in the sequences – redundancy requirements (each pair of sequences

must be separated r times)

• Two main steps for both algorithms:– Candidate generation– Greedy selection

Page 46: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

46

Candidate Generation

• Can be done using suffix trees

• We use a simpler yet efficient incremental approach

• Candidates that match all or only one sequence are removed from consideration

• Solution quality is similar even when candidates are generated from a single sequence– Equivalent to considering only distinguisher sets that

assign a barcode of (1,1,…,1) to the source sequence

Page 47: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

47

Candidate Selection

• Evaluate ∆() for all candidates and choose best

• Speed-up techniques– Efficient gain computation using partition data-

structure– Lazy gain update: if old ∆() is lower than best so

far, do not recompute

Page 48: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

48

Experimental Results

mat mat part part # n lazy lazy dist 100 35.4 22.1 2.2 1.4 8.0 200 221.6 125.2 8.8 4.6 10.0 500 2168.8 1144.4 53.0 18.7 12.31000 5600.4 2756.4 113.6 31.7 14.1

• Averages over 10 testcases, sequence length = 10,000• Barcodes for 100 sequences of length 1,000,000 computed

in less than 10 minutes

Page 49: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

49

Overview

Potential function greedy algorithm Primer Set Selection for Multiplex PCR The String Barcoding Problem Conclusions

Page 50: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

50

Conclusions

• General potential function framework for designing and analyzing greedy covering algorithms

• Improved approximation guarantees and practical performance for two important optimization problems in computational biology: primer set selection for multiplex PCR, and distinguisher selection for string barcoding

Page 51: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

51

Ongoing Work• Primer Set Selection

– Improved hybridization models

– Degenerate primers

– Partitioning into multiple multiplexed PCR reactions

– Close approximation gap for minimum multicolored sub-graph

• String Barcoding– Probe mixtures as distinguishers

– Beyond redundancy: error correcting

– Simultaneous detection of multiple pathogens

Page 52: Greedy Approximation Algorithms for Covering Problems in  Computational Biology

52

Acknowledgments

• B. DasGupta, K. Konwar, A. Russell, A. Shvartsman

• UCONN Research Foundation