dynamic programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/lecture7.pdf ·...

45
Dynamic Programming (cont’d) CS 466 Saurabh Sinha

Upload: others

Post on 17-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Dynamic Programming(cont’d)

CS 466Saurabh Sinha

Page 2: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Affine Gap Penalties

• In nature, a series of k indels often come as a singleevent rather than a series of k single nucleotideevents:

Normal scoring wouldgive the same scorefor both alignments

This is morelikely.

This is lesslikely.

ATA__GGCATGATCGC

ATA_G_GCATGATCGC

Page 3: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Accounting for Gaps• Gaps- contiguous sequence of spaces in one of the rows

• Score for a gap of length x is: -(ρ + σx) where ρ >0 is the penalty for introducing a gap: gap opening penalty ρ will be large relative to σ: gap extension penalty because you do not want to add too much of a penalty for

extending the gap.

Page 4: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Affine gap penalty in DP

• When computing si,j, need to look at si,j-1,si,j-2, si,j-3,…. and si-1,j, si-2,j, …

• Each cell needs O(n) time for update• O(n2) cells• Therefore, O(n3) algorithm• We can still do this in O(n2) time

Page 5: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Affine Gap PenaltyRecurrences

si,j = s i-1,j - σ max s i-1,j –(ρ+σ)

si,j = s i,j-1 - σ max s i,j-1 –(ρ+σ)

si,j = si-1,j-1 + δ (vi, wj) max s i,j s i,j

Continue Gap in w (deletion)Start Gap in w (deletion): from middle

Continue Gap in v (insertion)

Start Gap in v (insertion):from middle

Match or Mismatch

End deletion: from top

End insertion: from bottom

Page 6: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Reading assignmentSection 6.10 (J & P)Multiple Alignment

Page 7: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Gene Prediction

Page 8: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

• Gene: A sequence of nucleotides codingfor protein

• Gene Prediction Problem: Determine thebeginning and end positions of genes in agenome

Gene Prediction: Computational Challenge

Page 9: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the
Page 10: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

The

Gen

etic

Cod

e

SO

UR

CE

:ht

tp://

ww

w.b

iosc

ienc

e.or

g/at

lase

s/ge

neco

de/g

enec

ode.

htm

Page 11: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

• In 1961 Sydney Brenner and Francis Crickdiscovered frameshift mutations

• Systematically deleted nucleotides fromDNA– Single and double deletions dramatically

altered protein product– Effects of triple deletions were minor– Conclusion: every triplet of nucleotides,

each codon, codes for exactly oneamino acid in a protein

Codons

Page 12: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

• In 1964, Charles Yanofsky and Sydney Brennerproved collinearity in the order of codons withrespect to amino acids in proteins

• As a result, it was incorrectly assumed that thetriplets encoding for amino acid sequences formcontiguous strips of information.

Great Discovery Provoking Wrong Assumption

Page 13: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Exons and Introns• In eukaryotes, the gene is a combination of

coding segments (exons) that are interrupted bynon-coding segments (introns)

• This makes computational gene prediction ineukaryotes even more difficult

• Prokaryotes don’t have introns - Genes inprokaryotes are continuous

Page 14: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Splicing

exon1 exon2 exon3intron1 intron2

transcript ion

translat ion

sp licing

exon = cod ingintron = non-coding

Batzoglou

Page 15: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Gene prediction

• More difficult in eukaryotes than inprokaryotes (due to introns).

• In human genome, ~3% of DNAsequence is genes

• Lot of “junk” DNA between genes, andeven inside genes (between exons).

• Gene prediction must deal with this.

Page 16: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Gene prediction: broadlyspeaking

• Statistical approaches:look for features than appear frequentlyin genes and infrequently elsewhere

• Similarity based approaches:a newly sequenced gene may be similarto a known gene.– even this is not so simple. The exon

structures may be different betweenotherwise similar genes

Page 17: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Statistical approaches

Page 18: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Splicing Signals

Exons are interspersed with introns andtypically flanked by GT and AG

Page 19: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Splice site detection

5’ 3’Donor site

Position

% -8 … -2 -1 0 1 2 … 17

A 26 … 60 9 0 1 54 … 21

C 26 … 15 5 0 1 2 … 27

G 25 … 12 78 99 0 41 … 27

T 23 … 13 8 1 98 3 … 25

From lectures by Serafim Batzoglou (Stanford)

Page 20: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Consensus splice sites

Donor: 7.9 bitsAcceptor: 9.4 bits

Page 21: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Splicing and gene prediction

• Using splice sites (profiles) to predictgenes ?

• Limited scope, too many falsepredictions

Page 22: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

• Let us consider gene prediction in prokaryotes (no introns)

• Detect potential coding regions by looking at ORFs– A region of length n is comprised of (n/3) codons– Stop codons break genome into segments between

consecutive Stop codons– The subsegments of these that start from the Start codon

(ATG) are ORFs

Genomic Sequence

Open reading frame

ATG TGA

Open Reading Frames (ORFs)

Page 23: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

ORFs

• 6 reading frames in any given sequence– 6 ways to map the DNA sequence to codon

sequence (+1,+2,+3,-1,-2,-3)– 3 on either strand

• Look at all 6 reading frames for ORFs

Page 24: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

• Long open reading frames may be a gene– At random, we should expect one stop codon

every (64/3) ~= 21 codons– However, genes are usually much longer than

this• A basic approach is to scan for ORFs whose length

exceeds certain threshold– This is naïve because some genes (e.g. some

neural and immune system genes) are relativelyshort

Long vs.Short ORFs

Page 25: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Codon usage

• In a given sequence (e.g., an ORF), computefrequency distribution of codons (64 elementarray): codon usage array

• Codon usage array for coding sequences isdifferent from that for non-coding sequences

• If the codon usage array for an ORF is muchmore similar to that of coding sequences thanto that of non-coding sequences, the ORFcould be a gene

Page 26: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Codon usage

• Codons coding for “Arg” in human:– CGU: 37%, CGC: 38%, CGA: 7%, CGG:

10%, AGA: 5%, AGG: 3%– In a coding sequence, codon CGC is 12

times more likely than codon AGG– An ORF preferring CGC over AGG is likely

to be a gene

Page 27: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Codon Usage in Human Genome

Page 28: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Codon usage• One way to test if an ORF is a gene is to

compute– Pr(ORF sequence under a coding sequence

model)– Pr(ORF sequence under a non-coding model)– Ratio of the two.

• These methods work best in prokaryotes• The exon-intron trouble is not handled yet• Hidden Markov models that use codon usage

ideas and splice site ideas, all in one– We’ll see more of this in second half of course

Page 29: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Promoter Structure in Prokaryotes(E.Coli)

Transcription startsat offset 0.

• Pribnow Box (-10)

• Gilbert Box (-30)

• RibosomalBinding Site (+10)

Page 30: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Ribosomal Binding Site

Page 31: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Statistical approaches:summary

• Splicing sites• Codon usage• Promoter motifs, such as -10 element,

-30 element• Ribosome binding site

Page 32: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Similarity based approaches

• Some genomes may be very well-studied,with many genes having beenexperimentally verified.

• Closely-related organisms may havesimilar genes

• Unknown genes in one species may becompared to genes in some closely-related species

Page 33: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

The basic approach• Given a protein sequence, and a genomic

sequence, find a set of substrings of thegenomic sequence whose concatenation bestfits the protein sequence

• First cut: Find fragments in the genomicsequence that match portions of the proteinsequence (local alignment)

• Then find the “optimal” subset of non-overlapping fragments

Page 34: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Exon chaining

• Each of the fragments of the genomicsequence that somewhat match the protein(locally) is a putative exon

• The “goodness” of the match is the “weight”assigned to this putative exon

• Thus, we have a set of weighted intervals(l,r,w): for a fragment from l to r, with weight wrepresenting how well it matches (a portionof) the protein

Page 35: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Exon Chaining Problem

• Input: A set of weighted intervals (l,r,w)• Output: A maximum weight chain of

non-overlapping intervals from this set

Page 36: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Exon Chaining Problem: Graph Representation

• This problem can be solved with dynamicprogramming in O(n) time.

21

edge from every li to riedge between every two successive vertices

Page 37: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Assumptions

• No two intervals have a commonboundary point. So the (li,ri) define 2ndistinct points, if there are n intervals

Page 38: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Exon Chaining AlgorithmExonChaining (G, n) //Graph, number of intervalsfor i ← to 2n si ← 0for i ← 1 to 2n if vertex vi in G corresponds to right end of the interval I j ← index of vertex for left end of the interval I w ← weight of the interval I si ← max {sj + w, si-1}else si ← si-1return s2n

Page 39: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Not very helpful

• A chain is a set of non-overlappingexons in order (left to right)

• But the matching protein portions maynot be in the same order !

Page 40: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Spliced Alignment

• Begins by selecting either all putative exonsbetween potential acceptor and donor sites or byfinding all substrings similar to the target protein(as in the Exon Chaining Problem).

• This set is further filtered in a such a way thatattempt to retain all true exons, with some falseones.

• Then find the chain of exons such that thesequence similarity to the target proteinsequence is maximized

Page 41: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Spliced Alignment Problem: Formulation

• Input: Genomic sequences G, targetsequence T, and a set of candidateexons (blocks) B.

• Output: A chain of exons Γ such thatthe global alignment score between Γ*and T is maximized

Γ* - concatenation of all exons from chain Γ

Page 42: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

The DAG

• Vertices: One vertex for each block in B• Directed edge connecting non-overlapping blocks• Label of vertex = string of block it represents• A path through the DAG spells out the string

obtained by concatenating that particular chain ofblocks

• Weight of a path is the score of the optimalalignment between the string it spells out and thetarget sequence

Page 43: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Dynamic programming

• Genomic sequence G = g1g2…gn• Target sequence T = t1t2…tm• As usual, we want to find the optimal

alignment score of the i-prefix of G andthe j-prefix of T

• Problem is, there are many i-prefixespossible (since multiple blocks mayinclude position i)

Page 44: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Idea

• Find the optimal alignment score of thei-prefix of G and the j-prefix of Tassuming that this alignment uses aparticular block B at position i

• S(i, j, B)• For every block B that includes i

Page 45: Dynamic Programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture7.pdf · 2008-09-16 · •In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the

Recurrence

If i is not the starting vertex of block B:• S(i, j, B) =

max { S(i – 1, j, B) – indel penalty S(i, j – 1, B) – indel penalty

S(i – 1, j – 1, B) + δ(gi, tj) }

If i is the starting vertex of block B:• S(i, j, B) =

max { S(i, j – 1, B) – indel penaltymaxall blocks B’ preceding block B S(end(B’), j, B’) – indel penaltymaxall blocks B’ preceding block B S(end(B’), j – 1, B’) + δ(gi, tj)}