dynamic programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/lecture7.pdf ·...

Dynamic Programming(cont’d)

CS 466Saurabh Sinha

Affine Gap Penalties

• In nature, a series of k indels often come as a singleevent rather than a series of k single nucleotideevents:

Normal scoring wouldgive the same scorefor both alignments

This is morelikely.

This is lesslikely.

ATA__GGCATGATCGC

ATA_G_GCATGATCGC

Accounting for Gaps• Gaps- contiguous sequence of spaces in one of the rows

• Score for a gap of length x is: -(ρ + σx) where ρ >0 is the penalty for introducing a gap: gap opening penalty ρ will be large relative to σ: gap extension penalty because you do not want to add too much of a penalty for

extending the gap.

Affine gap penalty in DP

• When computing si,j, need to look at si,j-1,si,j-2, si,j-3,…. and si-1,j, si-2,j, …

• Each cell needs O(n) time for update• O(n2) cells• Therefore, O(n3) algorithm• We can still do this in O(n2) time

Affine Gap PenaltyRecurrences

si,j = s i-1,j - σ max s i-1,j –(ρ+σ)

si,j = s i,j-1 - σ max s i,j-1 –(ρ+σ)

si,j = si-1,j-1 + δ (vi, wj) max s i,j s i,j

Continue Gap in w (deletion)Start Gap in w (deletion): from middle

Continue Gap in v (insertion)

Start Gap in v (insertion):from middle

Match or Mismatch

End deletion: from top

End insertion: from bottom

Reading assignmentSection 6.10 (J & P)Multiple Alignment

Gene Prediction

• Gene: A sequence of nucleotides codingfor protein

• Gene Prediction Problem: Determine thebeginning and end positions of genes in agenome

Gene Prediction: Computational Challenge

The

Gen

etic

Cod

e

SO

UR

CE

:ht

tp://

ww

w.b

iosc

ienc

e.or

g/at

lase

s/ge

neco

de/g

enec

ode.

htm

• In 1961 Sydney Brenner and Francis Crickdiscovered frameshift mutations

• Systematically deleted nucleotides fromDNA– Single and double deletions dramatically

altered protein product– Effects of triple deletions were minor– Conclusion: every triplet of nucleotides,

each codon, codes for exactly oneamino acid in a protein

Codons

• In 1964, Charles Yanofsky and Sydney Brennerproved collinearity in the order of codons withrespect to amino acids in proteins

• As a result, it was incorrectly assumed that thetriplets encoding for amino acid sequences formcontiguous strips of information.

Great Discovery Provoking Wrong Assumption

Exons and Introns• In eukaryotes, the gene is a combination of

coding segments (exons) that are interrupted bynon-coding segments (introns)

• This makes computational gene prediction ineukaryotes even more difficult

• Prokaryotes don’t have introns - Genes inprokaryotes are continuous

Splicing

exon1 exon2 exon3intron1 intron2

transcript ion

translat ion

sp licing

exon = cod ingintron = non-coding

Batzoglou

Gene prediction

• More difficult in eukaryotes than inprokaryotes (due to introns).

• In human genome, ~3% of DNAsequence is genes

• Lot of “junk” DNA between genes, andeven inside genes (between exons).

• Gene prediction must deal with this.

Gene prediction: broadlyspeaking

• Statistical approaches:look for features than appear frequentlyin genes and infrequently elsewhere

• Similarity based approaches:a newly sequenced gene may be similarto a known gene.– even this is not so simple. The exon

structures may be different betweenotherwise similar genes

Statistical approaches

Splicing Signals

Exons are interspersed with introns andtypically flanked by GT and AG

Splice site detection

5’ 3’Donor site

Position

% -8 … -2 -1 0 1 2 … 17

A 26 … 60 9 0 1 54 … 21

C 26 … 15 5 0 1 2 … 27

G 25 … 12 78 99 0 41 … 27

T 23 … 13 8 1 98 3 … 25

From lectures by Serafim Batzoglou (Stanford)

Consensus splice sites

Donor: 7.9 bitsAcceptor: 9.4 bits

Splicing and gene prediction

• Using splice sites (profiles) to predictgenes ?

• Limited scope, too many falsepredictions

• Let us consider gene prediction in prokaryotes (no introns)

• Detect potential coding regions by looking at ORFs– A region of length n is comprised of (n/3) codons– Stop codons break genome into segments between

consecutive Stop codons– The subsegments of these that start from the Start codon

(ATG) are ORFs

Genomic Sequence

Open reading frame

ATG TGA

Open Reading Frames (ORFs)

ORFs

• 6 reading frames in any given sequence– 6 ways to map the DNA sequence to codon

sequence (+1,+2,+3,-1,-2,-3)– 3 on either strand

• Look at all 6 reading frames for ORFs

• Long open reading frames may be a gene– At random, we should expect one stop codon

every (64/3) ~= 21 codons– However, genes are usually much longer than

this• A basic approach is to scan for ORFs whose length

exceeds certain threshold– This is naïve because some genes (e.g. some

neural and immune system genes) are relativelyshort

Long vs.Short ORFs

Codon usage

• In a given sequence (e.g., an ORF), computefrequency distribution of codons (64 elementarray): codon usage array

• Codon usage array for coding sequences isdifferent from that for non-coding sequences

• If the codon usage array for an ORF is muchmore similar to that of coding sequences thanto that of non-coding sequences, the ORFcould be a gene

Codon usage

• Codons coding for “Arg” in human:– CGU: 37%, CGC: 38%, CGA: 7%, CGG:

10%, AGA: 5%, AGG: 3%– In a coding sequence, codon CGC is 12

times more likely than codon AGG– An ORF preferring CGC over AGG is likely

to be a gene

Codon Usage in Human Genome

Codon usage• One way to test if an ORF is a gene is to

compute– Pr(ORF sequence under a coding sequence

model)– Pr(ORF sequence under a non-coding model)– Ratio of the two.

• These methods work best in prokaryotes• The exon-intron trouble is not handled yet• Hidden Markov models that use codon usage

ideas and splice site ideas, all in one– We’ll see more of this in second half of course

Promoter Structure in Prokaryotes(E.Coli)

Transcription startsat offset 0.

• Pribnow Box (-10)

• Gilbert Box (-30)

• RibosomalBinding Site (+10)

Ribosomal Binding Site

Statistical approaches:summary

• Splicing sites• Codon usage• Promoter motifs, such as -10 element,

-30 element• Ribosome binding site

Similarity based approaches

• Some genomes may be very well-studied,with many genes having beenexperimentally verified.

• Closely-related organisms may havesimilar genes

• Unknown genes in one species may becompared to genes in some closely-related species

The basic approach• Given a protein sequence, and a genomic

sequence, find a set of substrings of thegenomic sequence whose concatenation bestfits the protein sequence

• First cut: Find fragments in the genomicsequence that match portions of the proteinsequence (local alignment)

• Then find the “optimal” subset of non-overlapping fragments

Exon chaining

• Each of the fragments of the genomicsequence that somewhat match the protein(locally) is a putative exon

• The “goodness” of the match is the “weight”assigned to this putative exon

• Thus, we have a set of weighted intervals(l,r,w): for a fragment from l to r, with weight wrepresenting how well it matches (a portionof) the protein

Exon Chaining Problem

• Input: A set of weighted intervals (l,r,w)• Output: A maximum weight chain of

non-overlapping intervals from this set

Exon Chaining Problem: Graph Representation

• This problem can be solved with dynamicprogramming in O(n) time.

21

edge from every li to riedge between every two successive vertices

Assumptions

• No two intervals have a commonboundary point. So the (li,ri) define 2ndistinct points, if there are n intervals

Exon Chaining AlgorithmExonChaining (G, n) //Graph, number of intervalsfor i ← to 2n si ← 0for i ← 1 to 2n if vertex vi in G corresponds to right end of the interval I j ← index of vertex for left end of the interval I w ← weight of the interval I si ← max {sj + w, si-1}else si ← si-1return s2n

Not very helpful

• A chain is a set of non-overlappingexons in order (left to right)

• But the matching protein portions maynot be in the same order !

Spliced Alignment

• Begins by selecting either all putative exonsbetween potential acceptor and donor sites or byfinding all substrings similar to the target protein(as in the Exon Chaining Problem).

• This set is further filtered in a such a way thatattempt to retain all true exons, with some falseones.

• Then find the chain of exons such that thesequence similarity to the target proteinsequence is maximized

Spliced Alignment Problem: Formulation

• Input: Genomic sequences G, targetsequence T, and a set of candidateexons (blocks) B.

• Output: A chain of exons Γ such thatthe global alignment score between Γ*and T is maximized

Γ* - concatenation of all exons from chain Γ

The DAG

• Vertices: One vertex for each block in B• Directed edge connecting non-overlapping blocks• Label of vertex = string of block it represents• A path through the DAG spells out the string

obtained by concatenating that particular chain ofblocks

• Weight of a path is the score of the optimalalignment between the string it spells out and thetarget sequence

Dynamic programming

• Genomic sequence G = g1g2…gn• Target sequence T = t1t2…tm• As usual, we want to find the optimal

alignment score of the i-prefix of G andthe j-prefix of T

• Problem is, there are many i-prefixespossible (since multiple blocks mayinclude position i)

Idea

• Find the optimal alignment score of thei-prefix of G and the j-prefix of Tassuming that this alignment uses aparticular block B at position i

• S(i, j, B)• For every block B that includes i

Recurrence

If i is not the starting vertex of block B:• S(i, j, B) =

max { S(i – 1, j, B) – indel penalty S(i, j – 1, B) – indel penalty

S(i – 1, j – 1, B) + δ(gi, tj) }

If i is the starting vertex of block B:• S(i, j, B) =

max { S(i, j – 1, B) – indel penaltymaxall blocks B’ preceding block B S(end(B’), j, B’) – indel penaltymaxall blocks B’ preceding block B S(end(B’), j – 1, B’) + δ(gi, tj)}

dynamic programming (cont’d)veda.cs.uiuc.edu/courses/fa08/cs466/lectures/lecture7.pdf ·...

Documents