class 3: sequence similarity

Class 3: Sequence similarity

Motivation

• Same gene, or similar gene

• Suffix of A similar to prefix of B?

• Suffix of A similar to prefix of B..Z?

• Longest similar substring of A, B

• Longest similar substring of A, B..Z

• For each, How big? How similar?

Define alignment

• Align these two sequences optimallyGACGGATT

GATCGGTT

• Define precisely what an alignment is

Definition of alignment

• Insert spaces so that the letters line up, or letters align with spaces

GA-CGGATT

GATCGG-TT

• Don’t allow spaces to line up

• Allow spaces even at beginning and end

GCAT-

-CATG

Define similarity

• Given an alignment, compute a similarity score

• Three possibilities for each column

letter-letter match

letter-letter mismatch

letter-space mismatch

Optimal alignment

• Create score function

• Conventionally:

+1 bonus for match

-1 penalty for letter-letter mismatch

-2 penalty for letter-space mismatch

Dynamic programming solution

• Given sequences s,t of length m,n

• Strategy: build up optimal alignment of prefixes

• Base case?

• Recurrence relation?

Recurrence

• Given opt alignment of prefixes of s,t shorter than i,j, find opt of s[1..i], t[1..j]

• Three possibilities:– extend s by a letter, t by a space– extend s by a letter, t by a letter– extend s by a space, t by a letter

Tiny instance -- AGC, AAAC

0 -2 -4 -6 -8

-2

-4

-6

Some dp details

• What is a good order to fill the array?

• How do you recover the opt alignment?

• What do you do about ties?

• What is the space complexity of this algorithm?

• What is the time complexity of this algorithm?

The gap penalty

• Model above assumes two gaps of size 1 are equivalent to one gap of size 2

• Is this realistic? Why or why not?

General gap penalties

• Alignments can no longer be scored as the sum of their parts

• They still are the sum of blocks with one matched letter or one gap each

• Blocks are: matched letters, s-gap, t-gapA|A|C|---|A|GAT|A|A|C

A|C|T|CGG|T|---|A|A|T

DP for general gaps

• Requires three array, one for each block type

• Time complexity is cubic

• This is expensive at best, prohibitive for large problems

• See Setubal/Meidanis 3.3.2 for details

Affine gap penalty

• Charge h for each gap, plus g * (len(gap))

• This still has quadratic complexity!

• See Setubal/Meidanis

Point accepted mutations

• Some mutations are more likely than others

• In proteins, some amino acids are more similar than others (size, charge, hydrophobicity)

• A point accepted mutation matrix is a table with probabilityof each transition in fixed time

PAM matrices

• The entire matrix sums to 1

• A ‘unit of evolution’ is time in which 1/100 amino acids is expected to change

Scoring matrix

• Consider aligned letters a,b

• Pr(b is a mutation of a) = Mab

• Pr(b is a random occurrence) = pb

• Score(a,b) = 10log(Mab / pb)

Blast

• Basic Local Alignment Search Tool

• Def: ‘segment’ is a subsequence (without gaps)

• Def: ‘segment pair’ is two segments of equal length

• Rem: the score of a segment pair is the sum of its aligned letters

What Blast does

• Input:– a PAM matrix– a database of sequences B– a query sequence A– a threshhold S

• Output:– all segment pairs(A,B) with score > S

How Blast works

• Compile short, high-scoring strings (words)

• Search for hits -- each hit gives a seed

• Extend seeds

Blast on proteins

• Words are w-mers which score at least T against A

• Use hashing or dfa to search for hits

• Extend seed until heuristically determined limit is reached

Blast on nucleic acids

• Words are w-mers in query A

• Letters compressed, four to byte

• Filter database B for very common words to avoid false positives

• Extend seeds as in proteins

What does Blast give you?

• Efficiency

• A rigorous statistical theory which gives the probability of a segment pair occurring by chance

Homework

• Given sequences s,t of length m,n, how many alignments do they have?

• Setubal/Meidanis, pp. 101, 102. Problems 2, 3, 4, 8, 16.

class 3: sequence similarity

Documents

gap of size

space complexity

gap eachblocks

gap penaltymodel

t of length

t shorter

similar genesuffix

blongest similar substring