8/31/07bcb 444/544 f07 isu dobbs #6 - scoring matrices & alignment stats1 bcb 444/544 lecture 6...

44
8/31/07 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment S tats 1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics #6_Aug31

Upload: eleanor-lamb

Post on 13-Dec-2015

239 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 1

BCB 444/544

Lecture 6

Finish Dynamic Programming

Scoring Matrices Alignment Statistics

#6_Aug31

Page 2: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 2

Mon Aug 27 - for Lecture #4Pairwise Sequence Alignment • Chp 3 - pp 31-41

Wed Aug 29 - for Lecture #5Dynamic Programming

• Eddy: What is Dynamic Programming? 2004 Nature Biotechnol

22:909

http://www.nature.com/nbt/journal/v22/n7/abs/nbt0704-909.html

Thurs Aug 30 - Lab #2:Databases, ISU Resources & Pairwise Sequence

Alignment

Fri Aug 31 - for Lecture #6Scoring Matrices & Alignment Statistics

• Chp 3 - pp 41-49

Required Reading (before lecture)

Page 3: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 3

Announcements

Fri Aug 31 - Revised notes for Lecture 5 posted onlineChanges? mainly re-ordering, symbols, color

"coding"

Mon Sept 3 - NO CLASSES AT ISU (Labor Day)!! - Enjoy!!

Tues Sept 4 - Lab #2 Exercise Writeup Due by 5 PM (or sooner!)

Send via email to Pete Zaback [email protected]

(HW#2 assignment will be posted online)

Fri Sept 14 - HW#2 Due by 5 PM (or sooner!)

Fri Sept 21 - Exam #1

Page 4: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 4

Chp 3- Sequence Alignment

SECTION II SEQUENCE ALIGNMENT

Xiong: Chp 3

Pairwise Sequence Alignment

• √Evolutionary Basis • √Sequence Homology versus Sequence Similarity • √Sequence Similarity versus Sequence Identity • Methods - cont

• Scoring Matrices• Statistical Significance of Sequence

AlignmentAdapted from Brown and Caragea, 2007, with some slides from: Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page.

Page 5: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 5

Methods

• √Global and Local Alignment• √Alignment Algorithms• √Dot Matrix Method• Dynamic Programming Method - cont

• Gap penalities• DP for Global Alignment• DP for Local Alignment

• Scoring Matrices• Amino acid scoring matrices

• PAM• BLOSUM• Comparisons between PAM & BLOSUM

• Statistical Significance of Sequence Alignment

Page 6: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 6

Sequence Homology vs Similarity

• Homologous sequences - sequences that share a common evolutionary ancestry

• Similar sequences - sequences that have a high percentage of aligned residues with similar physicochemical properties

(e.g., size, hydrophobicity, charge)

IMPORTANT:• Sequence homology:

• An inference about a common ancestral relationship, drawn when two sequences share a high enough degree of sequence similarity

• Homology is qualitative

• Sequence similarity:• The direct result of observation from a sequence alignment • Similarity is quantitative; can be described using percentages

Page 7: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 7

Goal of Sequence Alignment

Find the best pairing of 2 sequences, such that there is maximum correspondence between residues

• DNA 4 letter alphabet (+ gap)

TTGACACTTTACAC

• Proteins 20 letter alphabet (+ gap)

RKVA-GMA RKIAVAMA

Page 8: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8

Statement of Problem

Given: • 2 sequences• Scoring system for evaluating match

(or mismatch) of two characters • Penalty function for gaps in

sequences

Find: Optimal pairing of sequences that:• Retains the order of characters• Introduces gaps where needed• Maximizes total score

Page 9: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 9

Avoiding Random Alignments with a Scoring Function

• Introducing too many gaps generates nonsense alignments:

s--e-----qu---en--cesometimesquipsentice

• Need to distinguish between alignments that occur due to homology and those that occur by chance

• Define a scoring function that rewards matches (+) and penalizes mismatches (-) and gaps (-)

Scoring Function (S): e.g. Match: 1 Mismatch: 1 Gap: 0

S = (#matches) - (#mismatches) - (#gaps)

Note: I changed symbols & colors on this slide!

Page 10: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 10

Not All Mismatches are the Same

• Some amino acids are more "exchangeable" than others (physicochemical properties are similar)

e.g., Ser & Thr are more similar than Trp & Ala

• Substitution matrix can be used to introduce "mismatch costs" for handling different types of substitutions

• Mismatch costs are not usually used in aligning DNA or RNA sequences, because no substitution is "better" than any other (in general)

Page 11: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 11

Substitution Matrix

s(a,b) corresponds to score of aligning character a with character b

Match scores are often calculated based on frequency of mutations in very similar sequences

(more details later)

Page 12: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 12

Global vs Local Alignment

Local alignment

• Finds local regions with highest similarity between 2 sequences

• Aligns these without regard for rest of sequence

• Sequences are not assumed to be similar over entire length

Global alignment

• Finds best possible alignment across entire length of 2 sequences

• Aligned sequences assumed to be generally similar over entire length

Page 13: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 13

Global vs Local Alignment - example

1 = CTGTCGCTGCACG2 = TGCCGTG

CTGTCGCTGCACG

-TGCCG-T----G

Global alignment

CTGTCGCTGCACG

-TG-C-C-G--TGCTGTCGCTGCACG-TGCCG-TG----

Local alignment

Which is better?

Page 14: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 14

Global vs Local Alignment Which should be used when?

It is critical to choose correct method!

Global Alignment vs Local Alignment?

Shout out the answers!! Which should we use for?

1. Searching for conserved motifs in DNA or protein sequences?

2. Aligning two closely related sequences with similar lengths?3. Aligning highly divergent sequences?4. Generating an extended alignment of closely related

sequences?5. Generating an extended alignment of closely related

sequences with very different lengths?Hmmm - we'll work on that

Excellent!

Page 15: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 15

Alignment Algorithms

3 major methods for pairwise sequence alignment:

1. Dot matrix analysis

2. Dynamic programming

3. Word or k-tuple methods (later, in Chp 4)

Page 16: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 16

Dot Matrix Method (Dot Plots)

• Place 1 sequence along top row of matrix • Place 2nd sequence along left column of

matrix• Plot a dot each time there is a match

between an element of row sequence and an element of column sequence

• For proteins, usually use more sophisticated scoring schemes than "identical match"

• Diagonal lines indicate areas of match

• Contiguous diagonal lines reveal alignment; "breaks" = gaps (indels)

ACACG

A CC G G

Page 17: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 17

Interpretation of Dot Plots

When comparing 2 sequences:• Diagonal lines of dots indicate regions of

similarity between 2 sequences• Reverse diagonals (perpendicular to diagonal)

indicate inversions

• What do such patterns mean when comparing a sequence with itself (or its reverse complement)?

• e.g.: Reverse diagonals crossing diagonals (X's) indicate palindromes

Exploring Dot Plots

Page 18: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 18

Dynamic Programming

C A T - T C A - C | | | | | C - T C G C A G C

Idea: Display one sequence above another with spaces inserted in both to reveal similarity

For Pairwise sequence alignment

Page 19: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 19

Global Alignment: Scoring

CTGTCG-CTGCACG

-TGC-CG-TG----Reward for matches: Mismatch penalty: Space/gap penalty:

Score = w – x - y

w = #matches x = #mismatches y = #spacesNote: I changed symbols & colors on this slide!

Page 20: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 20

Global Alignment: Scoring

C T G T C G – C T G C - T G C – C G – T G -

-5 10 10 -2 -5 -2 -5 -5 10 10 -5

Total = 11

Reward for matches:10

Mismatch penalty: -2Space/gap penalty: -5

We could have done better!! Note: I changed symbols & colors on this slide!

Page 21: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 21

Alignment Algorithms

• Global: Needleman-Wunsch• Local: Smith-Waterman

• Both NW and SW use dynamic programming• Variations:

• Gap penalty functions• Scoring matrices

Page 22: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 22

Dynamic Programming - Key Idea:

The score of the best possible alignment that ends at a

given pair of positions (i, j) is equal to:

the score of best alignment ending just previous

to those two positions (i.e., ending at i-1, j-1)

PLUS

the score for aligning xi and yj

Page 23: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 23

Global Alignment: DP Problem Formulation &

NotationsGiven two sequences (strings)• X = x1x2…xN of length N x = AGC N = 3

• Y = y1y2…yM of length M y = AAAC M = 4

Construct a matrix with (N+1) x (M+1) elements, where

S(i,j) = Score of best alignment of x[1..i]=x1x2…xi with y[1..j]=y1y2…yj

S(2,3) = score of best alignment

of AG (x1x2) to AAA (y1y2y3)

x1 x2 x3

y1

y2

y3

y4

Which means: Score of best alignment of a prefix of X and a prefix of Y

Page 24: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 24

Dynamic Programming - 4 Steps:

1. Define score of optimum alignment, using

recursion

2. Initialize and fill in a DP matrix for storing optimal scores of subproblems, by solving smallest subproblems first (bottom-up approach)

3. Calculate score of optimum alignment(s)

4. Trace back through matrix to recover optimum alignment(s) that generated optimal score

Page 25: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 25

S(i, j) = max

S(i −1, j −1)+σ (xi ,y j )

S(i −1, j) −γ

S(i, j −1) −γ

⎨ ⎪

⎩ ⎪

S(i,0) = −i ⋅γ S(0, j) = − j ⋅γInitial conditions:

Recursive definition:

For 1 i N, 1 j M:

1- Define Score of Optimum Alignment using Recursion

S(i, j) = Score of optimum alignment of x1..i and y1..j

x1..i = Prefix of length i of x

y1.. j = Prefix of length j of yDefine:

Page 26: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 26

2- Initialize & Fill in DP Matrix for Storing Optimal Scores of Subproblems

S(i, j) = max

S(i −1, j −1)+σ (xi, y

j)

S(i −1, j)−γ

S(i, j −1)−γ

⎨ ⎪

⎩ ⎪

S(i,0) = −i ⋅γ

S(0, j) = − j ⋅γ

S(N,M)

S(0,0)=0

S(i,j)

S(i-1,j)S(i-1,j-1)

S(i,j-1)

00 1 N

1

M

InitializationRecursion

• Construct sequence vs sequence matrix:

Page 27: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 27

2- cont Fill in DP Matrix

S(N,M)

S(0,0)=0

S(i,j)

S(i-1,j)S(i-1,j-1)

S(i,j-1)

00 1 N

1

M

• Fill in from [0,0] to [N,M] (row by row), calculating best possible score for each alignment including residues at [i,j]• Keep track of dependencies of scores (in a pointer matrix).

Page 28: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 28

x1 x2 . . . xi-1 xi

y1 y2 . . . yj-1 yj

S(i-1,j-1) + (xi,yj)

x1 x2 . . . xi-1 xi

y1 y2 . . . yj —

S(i-1,j) -

x1 x2 . . . xi —

y1 y2 . . . yj-1 yj

S(i,j-1) -

xi aligns to yj xi aligns to a gap yj aligns to a gap

3- Calculate Score S(N,M) of Optimum Alignment - for Global Alignment

What happens in last step in alignment of x[1..i] to y[1..j]?

1 of 3 cases applies:

Page 29: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 29

Example

Case 1: Line up xi with yj

x: C A T T C A C y: C - T T C A G

i - 1 i

jj -1

x: C A T T C A - C y: C - T T C A G -

Case 2: Line up xi with spacei - 1 i

j

x: C A T T C A C - y: C - T T C A - G

Case 3: Line up yj with spacei

jj -1

Page 30: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 30

λ C T C G C A G C

A

C

T

T

C

A

C

0 -5 -10 -15 -20 -25 -30 -35 -40

-5

-10

-15

-20

-25

-30

-35

10 5

λ

+10 for match, -2 for mismatch, -5 for space

Fill in the matrix

Page 31: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 31

+10 for match, -2 for mismatch, -5 for space

Calculate score of optimum alignment

0 -5 -10 -15 -20 -25 -30 -35 -40

-5 10 5 0 -5 -10 -15 -20 -25

-10 5 8 3 -2 -7 0 -5 -10

-15 0 15 10 5 0 -5 -2 -7

-20 -5 10 13 8 3 -2 -7 -4

-25 -10 5 20 15 18 13 8 3

-30 -15 0 15 18 13 28 23 18

-35 -20 -5 10 13 28 23 26 33

λ C T C G C A G C

C

A

C

T

T

C

A

λ

Page 32: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 32

4- Trace back through matrix to recover optimum alignment(s) that generated the optimal score

How? "Repeat" alignment calculations in reverse order, starting at from position with highest score and following path, position by position, back through matrix

Result? Optimal alignment(s) of sequences

Page 33: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 33

Traceback - for Global Alignment

Start in lower right corner & trace back to upper left

Each arrow introduces one character at end of sequence alignment:

• A horizontal move puts a gap in left sequence• A vertical move puts a gap in top sequence• A diagonal move uses one character from each

sequence

Page 34: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 34

0 -5 -10 -15 -20 -25 -30 -35 -40

-5 10 5 0 -5 -10 -15 -20 -25

-10 5 8 3 -2 -7 0 -5 -10

-15 0 15 10 5 0 -5 -2 -7

-20 -5 10 13 8 3 -2 -7 -4

-25 -10 5 20 15 18 13 8 3

-30 -15 0 15 18 13 28 23 18

-35 -20 -5 10 13 28 23 26 33

λ C T C G C A G C

C

A

C

T

T

C

A

λ

**

Can have >1 optimum alignment; this example has 2

Traceback to Recover Alignment

Page 35: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 35

Local Alignment: Motivation

• To "ignore" stretches of non-coding DNA:• Non-coding regions (if "non-functional") are more likely to

contain mutations than coding regions• Local alignment between two protein-encoding sequences

is likely to be between two exons

• To locate protein domains or motifs:• Proteins with similar structures and/or similar functions

but from different species (for example), often exhibit local sequence similarities

• Local sequence similarities may indicate ”functional modules”

Non-coding - "not encoding protein"

Exons - "protein-encoding" parts of genes vs Introns = "intervening sequences" - segments of eukaryotic

genes that "interrupt" exons Introns are transcribed into RNA, but are later removed by

RNA processing & are not translated into protein

Page 36: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 36

Local Alignment: Example

Best local alignment:

Match: +2 Mismatch or space: -1

Score = 5

g g t c t g a ga a a c g a

g g t c t g a ga a a c – g a -

Page 37: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 37

Local Alignment: Algorithm

•S [i, j] = Score for optimally aligning a suffix of X with a suffix of Y

• Initialize top row & leftmost column of matrix with "0"

Recall: for Global Alignment,

• S [i, j] = Score for optimally aligning a prefix of X with a prefix of Y• Initialize top row & leftmost column of with gap penalty

Page 38: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 38

0 0 0 0 0 0 0 0 0

0 1 0 1 0 1 0 0 1

0 0 0 0 0 0 2 0 0

0 0 1 0 0 0 0 1 0

0 0 1 0 0 0 0 0 0

0 1 0 2 0 1 0 0 1

0 0 0 0 1 0 2 0 0

0 1 0 1 0 2 0 1 1

λ C T C G C A G C

A

C

T

T

C

A

C

λ

+1 for a match, -1 for a mismatch, -5 for a space

Traceback - for Local Alignment

Page 39: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 39

Some Results re: Alignment Algorithms

(for ComS, CprE & Math types!)

• Most pairwise sequence alignment problems can be solved in O(mn) time

• Space requirement can be reduced to O(m+n), while keeping run-time fixed [Myers88]

• Highly similar sequences can be aligned in O (dn) time, where d measures the distance between the sequences [Landau86]

Page 40: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 40

"Scoring" or "Substitution" Matrices

2 Major types for Amino Acids: PAM & BLOSUM

PAM = Point Accepted Mutation relies on "evolutionary model" based on observed

differences in alignments of closely related proteins

BLOSUM = BLOck SUbstitution Matrix based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins

Page 41: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 41

PAM Matrix

PAM = Point Accepted Mutation relies on "evolutionary model" based on observed differnces in closely related proteins

• Model includes defined rate for each type of sequence change

• Suffix number (n) reflects amount of "time" passed: rate of expected mutation if n% of amino acids had changed

• PAM1 - for less divergent sequences (shorter time)• PAM250 - for more divergent sequences (longer time)

Page 42: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 42

BLOSUM Matrix

BLOSUM = BLOck SUbstitution Matrix based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins

• Doesn't rely on a specific evolutionary model• Suffix number (n) reflects expected similarity:

average % aa identity in the MSA from which the matrix was generated

• BLOSUM45 - for more divergent sequences• BLOSUM62 - for less divergent sequences

Page 43: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 43

Statistical Significance of Sequence Alignment

Page 44: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics

8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 44

Affine Gap Penalty Functions

Gap penalty = h + gk

where

k = length of gaph = gap opening penaltyg = gap extension penalty

Can also be solved in O(nm) time using dynamic programming