dynamic programming (cont’d)

44
Dynamic Programming (cont’d) CS 466 Saurabh Sinha

Upload: nell

Post on 09-Feb-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Dynamic Programming (cont’d). CS 466 Saurabh Sinha. This is more likely. This is less likely. Affine Gap Penalties. In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events:. ATA__GGC ATGATCGC. ATA_G_GC ATGATCGC. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dynamic Programming (cont’d)

Dynamic Programming (cont’d)

CS 466Saurabh Sinha

Page 2: Dynamic Programming (cont’d)

Affine Gap Penalties

• In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events:

Normal scoring would give the same score for both alignments

This is more likely.

This is less likely.

ATA__GGCATGATCGC

ATA_G_GCATGATCGC

Page 3: Dynamic Programming (cont’d)

Accounting for Gaps• Gaps- contiguous sequence of spaces in one of the rows

• Score for a gap of length x is: -(ρ + σx) where ρ >0 is the penalty for introducing a gap: gap opening penalty ρ will be large relative to σ: gap extension penalty because you do not want to add too much of a penalty for

extending the gap.

Page 4: Dynamic Programming (cont’d)

Affine gap penalty in DP

• When computing si,j, need to look at si,j-1, si,j-2, si,j-3,…. and si-1,j, si-2,j, …

• Each cell needs O(n) time for update• O(n2) cells• Therefore, O(n3) algorithm• We can still do this in O(n2) time

Page 5: Dynamic Programming (cont’d)

Affine Gap Penalty Recurrences

si,j = s i-1,j - σ max s i-1,j –(ρ+σ)

si,j = s i,j-1 - σ max s i,j-1 –(ρ+σ)

si,j = si-1,j-1 + δ (vi, wj) max s i,j s i,j

Continue Gap in w (deletion)Start Gap in w (deletion): from middle

Continue Gap in v (insertion)

Start Gap in v (insertion):from middle

Match or Mismatch

End deletion: from top

End insertion: from bottom

Page 6: Dynamic Programming (cont’d)

Optional Reading Section 6.10 (J & P)Multiple Alignment

Page 7: Dynamic Programming (cont’d)

Gene Prediction

Page 8: Dynamic Programming (cont’d)

• Gene: A sequence of nucleotides coding for protein

• Gene Prediction Problem: Determine the beginning and end positions of genes in a genome

Gene Prediction: Computational Challenge

Page 9: Dynamic Programming (cont’d)
Page 10: Dynamic Programming (cont’d)

The

Gen

etic

Cod

e

SO

UR

CE

: ht

tp://

ww

w.b

iosc

ienc

e.or

g/at

lase

s/ge

neco

de/g

enec

ode.

htm

Page 11: Dynamic Programming (cont’d)

• In 1961 Sydney Brenner and Francis Crick discovered frameshift mutations

• Systematically deleted nucleotides from DNA– Single and double deletions dramatically

altered protein product– Effects of triple deletions were minor– Conclusion: every triplet of nucleotides,

each codon, codes for exactly one amino acid in a protein

Codons

Page 12: Dynamic Programming (cont’d)

• In 1964, Charles Yanofsky and Sydney Brenner proved colinearity in the order of codons with respect to amino acids in proteins

• As a result, it was incorrectly assumed that the triplets encoding for amino acid sequences form contiguous strips of information.

Great Discovery Provoking Wrong Assumption

Page 13: Dynamic Programming (cont’d)

Exons and Introns• In eukaryotes, the gene is a combination of

coding segments (exons) that are interrupted by non-coding segments (introns)

• This makes computational gene prediction in eukaryotes even more difficult

• Prokaryotes don’t have introns - Genes in prokaryotes are continuous

Page 14: Dynamic Programming (cont’d)

Splicingexon1 exon2 exon3

intron1 intron2

transcription

translation

splicing

exon = codingintron = non-coding

Batzoglou

Page 15: Dynamic Programming (cont’d)

Gene prediction

• More difficult in eukaryotes than in prokaryotes (due to introns).

• In human genome, ~3% of DNA sequence is genes

• Lot of “junk” DNA between genes, and even inside genes (between exons).

• Gene prediction must deal with this.

Page 16: Dynamic Programming (cont’d)

Gene prediction: broadly speaking

• Statistical approaches:look for features than appear frequently in genes and infrequently elsewhere

• Similarity based approaches: a newly sequenced gene may be similar to a known gene.– even this is not so simple. The exon structures

may be different between otherwise similar genes

Page 17: Dynamic Programming (cont’d)

Statistical approaches

Page 18: Dynamic Programming (cont’d)

• Let us consider gene prediction in prokaryotes (no introns)

• Detect potential coding regions by looking at ORFs– A region of length n is comprised of (n/3) codons– Stop codons break genome into segments between

consecutive Stop codons– The subsegments of these that start from the Start codon

(ATG) are ORFs

Genomic Sequence

Open reading frame

ATG TGA

Open Reading Frames (ORFs)

Page 19: Dynamic Programming (cont’d)

ORFs

• 6 reading frames in any given sequence– 6 ways to map the DNA sequence to codon

sequence (+1,+2,+3,-1,-2,-3)– 3 on either strand

• Look at all 6 reading frames for ORFs

Page 20: Dynamic Programming (cont’d)

• Long open reading frames may be a gene– At random, we should expect one stop codon

every (64/3) ~= 21 codons– However, genes are usually much longer than

this• A basic approach is to scan for ORFs whose length

exceeds certain threshold– This is naïve because some genes (e.g. some

neural and immune system genes) are relatively short

Long vs.Short ORFs

Page 21: Dynamic Programming (cont’d)

Codon usage• In a given sequence (e.g., an ORF), compute

frequency distribution of codons (64 element array): codon usage array

• Codon usage array for coding sequences is different from that for non-coding sequences

• If the codon usage array for an ORF is much more similar to that of coding sequences than to that of non-coding sequences, the ORF could be a gene

Page 22: Dynamic Programming (cont’d)

Codon usage

• Codons coding for “Arg” in human:– CGU: 37%, CGC: 38%, CGA: 7%, CGG:

10%, AGA: 5%, AGG: 3%– In a coding sequence, codon CGC is 12

times more likely than codon AGG– An ORF preferring CGC over AGG is likely

to be a gene

Page 23: Dynamic Programming (cont’d)

Codon Usage in Human Genome

Page 24: Dynamic Programming (cont’d)

Codon usage• One way to test if an ORF is a gene is to

compute– Pr(ORF sequence under a coding sequence

model)– Pr(ORF sequence under a non-coding model)– Ratio of the two.

• These methods work best in prokaryotes• The exon-intron trouble is not handled yet

Page 25: Dynamic Programming (cont’d)

Promoter Structure in Prokaryotes (E.Coli)

Transcription starts at offset 0.

• Pribnow Box (-10)

• Gilbert Box (-30)

• Ribosomal Binding Site (+10)

Page 26: Dynamic Programming (cont’d)

Ribosomal Binding Site

Page 27: Dynamic Programming (cont’d)

Splicing Signals: an additional statistical clue, for eukaryotes

Exons are interspersed with introns and typically flanked by GT and AG

Page 28: Dynamic Programming (cont’d)

Splice site detection5’ 3’

Donor site

Position% -8 … -2 -1 0 1 2 … 17A 26 … 60 9 0 1 54 … 21C 26 … 15 5 0 1 2 … 27G 25 … 12 78 99 0 41 … 27T 23 … 13 8 1 98 3 … 25

From lectures by Serafim Batzoglou (Stanford)

Page 29: Dynamic Programming (cont’d)

Consensus splice sites

Page 30: Dynamic Programming (cont’d)

Statistical approaches: summary

• Codon usage

• Promoter motifs

• Ribosome binding site

• Splicing sites

Page 31: Dynamic Programming (cont’d)

Similarity based approaches

Page 32: Dynamic Programming (cont’d)

Similarity based approaches

• Some genomes may be very well-studied, with many genes having been experimentally verified.

• Closely-related organisms may have similar genes

• Unknown genes in one species may be compared to genes in some closely-related species

Page 33: Dynamic Programming (cont’d)

The basic approach• Given a protein sequence, and a genomic sequence,

find a set of substrings of the genomic sequence whose concatenation best fits the protein sequence

• Deals with the exon-intron problem

• First cut: Find fragments in the genomic sequence that match portions of the protein sequence (local alignment)

• Then find the “optimal” subset of non-overlapping fragments

Page 34: Dynamic Programming (cont’d)

Exon chaining

• Each of the fragments of the genomic sequence that somewhat match the protein (locally) is a putative exon

• The “goodness” of the match is the “weight” assigned to this putative exon

• Thus, we have a set of weighted intervals (l,r,w): for a fragment from l to r, with weight w representing how well it matches (a portion of) the protein

Page 35: Dynamic Programming (cont’d)

Exon Chaining Problem

• Input: A set of weighted intervals (l,r,w)• Output: A maximum weight chain of

non-overlapping intervals from this set

Page 36: Dynamic Programming (cont’d)

Exon Chaining Problem: Graph Representation

• This problem can be solved with dynamic programming in O(n) time.

21

edge from every li to ri

edge between every two successive vertices

Page 37: Dynamic Programming (cont’d)

Assumptions

• No two intervals have a common boundary point. So the (li,ri) define 2n distinct points, if there are n intervals

Page 38: Dynamic Programming (cont’d)

Exon Chaining AlgorithmExonChaining (G, n) //Graph, number of intervalsfor i ← to 2n si ← 0for i ← 1 to 2n if vertex vi in G corresponds to right end of the interval

I j ← index of vertex for left end of the interval I w ← weight of the interval I si ← max {sj + w, si-1}else si ← si-1

return s2n

Page 39: Dynamic Programming (cont’d)

Not very helpful

• A chain is a set of non-overlapping exons in order (left to right)

• But the matching protein portions may not be in the same order !

Page 40: Dynamic Programming (cont’d)

Spliced Alignment• Begins by selecting either all putative exons

between potential acceptor and donor sites or by finding all substrings similar to the target protein (as in the Exon Chaining Problem).

• This set is further filtered in a such a way that attempt to retain all true exons, with some false ones.

• Then find the chain of exons such that the sequence similarity to the target protein sequence is maximized

Page 41: Dynamic Programming (cont’d)

Spliced Alignment Problem: Formulation

• Input: Genomic sequences G, target sequence T, and a set of candidate exons (blocks) B.

• Output: A chain of exons Γ such that the global alignment score between Γ* and T is maximized

Γ* - concatenation of all exons from chain Γ

Page 42: Dynamic Programming (cont’d)

Dynamic programming

• Genomic sequence G = g1g2…gn

• Target sequence T = t1t2…tm

• As usual, we want to find the optimal alignment score of the i-prefix of G and the j-prefix of T

• Problem is, there are many i-prefixes possible (since multiple blocks may include position i)

Page 43: Dynamic Programming (cont’d)

Idea

• Find the optimal alignment score of the i-prefix of G and the j-prefix of T assuming that this alignment uses a particular block B at position i

• S(i, j, B) • For every block B that includes i

Page 44: Dynamic Programming (cont’d)

Recurrence If i is not the starting vertex of block B:• S(i, j, B) =

max { S(i – 1, j, B) – indel penalty S(i, j – 1, B) – indel penalty

S(i – 1, j – 1, B) + δ(gi, tj) }

If i is the starting vertex of block B:• S(i, j, B) =

max { S(i, j – 1, B) – indel penalty maxall blocks B’ preceding block B S(end(B’), j, B’) – indel penalty maxall blocks B’ preceding block B S(end(B’), j – 1, B’) + δ(gi, tj)}