class 3: sequence similarity

24
Class 3: Sequence similarity

Upload: arissa

Post on 18-Jan-2016

47 views

Category:

Documents


0 download

DESCRIPTION

Class 3: Sequence similarity. Motivation. Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar substring of A, B Longest similar substring of A, B..Z For each, How big? How similar?. Define alignment. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Class 3: Sequence similarity

Class 3: Sequence similarity

Page 2: Class 3: Sequence similarity

Motivation

• Same gene, or similar gene

• Suffix of A similar to prefix of B?

• Suffix of A similar to prefix of B..Z?

• Longest similar substring of A, B

• Longest similar substring of A, B..Z

• For each, How big? How similar?

Page 3: Class 3: Sequence similarity

Define alignment

• Align these two sequences optimallyGACGGATT

GATCGGTT

• Define precisely what an alignment is

Page 4: Class 3: Sequence similarity

Definition of alignment

• Insert spaces so that the letters line up, or letters align with spaces

GA-CGGATT

GATCGG-TT

• Don’t allow spaces to line up

• Allow spaces even at beginning and end

GCAT-

-CATG

Page 5: Class 3: Sequence similarity

Define similarity

• Given an alignment, compute a similarity score

• Three possibilities for each column

letter-letter match

letter-letter mismatch

letter-space mismatch

Page 6: Class 3: Sequence similarity

Optimal alignment

• Create score function

• Conventionally:

+1 bonus for match

-1 penalty for letter-letter mismatch

-2 penalty for letter-space mismatch

Page 7: Class 3: Sequence similarity

Dynamic programming solution

• Given sequences s,t of length m,n

• Strategy: build up optimal alignment of prefixes

• Base case?

• Recurrence relation?

Page 8: Class 3: Sequence similarity

Recurrence

• Given opt alignment of prefixes of s,t shorter than i,j, find opt of s[1..i], t[1..j]

• Three possibilities:– extend s by a letter, t by a space– extend s by a letter, t by a letter– extend s by a space, t by a letter

Page 9: Class 3: Sequence similarity

Tiny instance -- AGC, AAAC

0 -2 -4 -6 -8

-2

-4

-6

Page 10: Class 3: Sequence similarity

Some dp details

• What is a good order to fill the array?

• How do you recover the opt alignment?

• What do you do about ties?

• What is the space complexity of this algorithm?

• What is the time complexity of this algorithm?

Page 11: Class 3: Sequence similarity

The gap penalty

• Model above assumes two gaps of size 1 are equivalent to one gap of size 2

• Is this realistic? Why or why not?

Page 12: Class 3: Sequence similarity

General gap penalties

• Alignments can no longer be scored as the sum of their parts

• They still are the sum of blocks with one matched letter or one gap each

• Blocks are: matched letters, s-gap, t-gapA|A|C|---|A|GAT|A|A|C

A|C|T|CGG|T|---|A|A|T

Page 13: Class 3: Sequence similarity

DP for general gaps

• Requires three array, one for each block type

• Time complexity is cubic

• This is expensive at best, prohibitive for large problems

• See Setubal/Meidanis 3.3.2 for details

Page 14: Class 3: Sequence similarity

Affine gap penalty

• Charge h for each gap, plus g * (len(gap))

• This still has quadratic complexity!

• See Setubal/Meidanis

Page 15: Class 3: Sequence similarity

Point accepted mutations

• Some mutations are more likely than others

• In proteins, some amino acids are more similar than others (size, charge, hydrophobicity)

• A point accepted mutation matrix is a table with probabilityof each transition in fixed time

Page 16: Class 3: Sequence similarity

PAM matrices

• The entire matrix sums to 1

• A ‘unit of evolution’ is time in which 1/100 amino acids is expected to change

Page 17: Class 3: Sequence similarity

Scoring matrix

• Consider aligned letters a,b

• Pr(b is a mutation of a) = Mab

• Pr(b is a random occurrence) = pb

• Score(a,b) = 10log(Mab / pb)

Page 18: Class 3: Sequence similarity

Blast

• Basic Local Alignment Search Tool

• Def: ‘segment’ is a subsequence (without gaps)

• Def: ‘segment pair’ is two segments of equal length

• Rem: the score of a segment pair is the sum of its aligned letters

Page 19: Class 3: Sequence similarity

What Blast does

• Input:– a PAM matrix– a database of sequences B– a query sequence A– a threshhold S

• Output:– all segment pairs(A,B) with score > S

Page 20: Class 3: Sequence similarity

How Blast works

• Compile short, high-scoring strings (words)

• Search for hits -- each hit gives a seed

• Extend seeds

Page 21: Class 3: Sequence similarity

Blast on proteins

• Words are w-mers which score at least T against A

• Use hashing or dfa to search for hits

• Extend seed until heuristically determined limit is reached

Page 22: Class 3: Sequence similarity

Blast on nucleic acids

• Words are w-mers in query A

• Letters compressed, four to byte

• Filter database B for very common words to avoid false positives

• Extend seeds as in proteins

Page 23: Class 3: Sequence similarity

What does Blast give you?

• Efficiency

• A rigorous statistical theory which gives the probability of a segment pair occurring by chance

Page 24: Class 3: Sequence similarity

Homework

• Given sequences s,t of length m,n, how many alignments do they have?

• Setubal/Meidanis, pp. 101, 102. Problems 2, 3, 4, 8, 16.