Download - BCB 444/544

Transcript
Page 1: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 1

BCB 444/544

Lecture 8

Finish: Dynamic Programming Global vs Local Alignment

Scoring Matrices & Alignment Statistics

BLAST

#8_Sept7

Page 2: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 2

√Last week: - for Lectures 4-7Pairwise Sequence Alignment, Dynamic Programming,

Global vs Local Alignment, Scoring Matrices, Statistics • Xiong: Chp 3 • Eddy: What is Dynamic Programming? 2004 Nature Biotechnol 22:909 http://www.nature.com/nbt/journal/v22/n7/abs/nbt0704-909.html

√Wed Sept 5 - for Lecture 7 & Lab 3Database Similarity Searching: BLAST (nope, more DP)

• Chp 4 - pp 51-62

Fri Sept 7 - for Lecture 8 (will finish on Monday)BLAST variations; BLAST vs FASTA

• Chp 4 - pp 51-62

Required Reading (before lecture)

Page 3: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 3

Assignments & Announcements

√Tues Sept 4 - Lab #2 Exercise Writeup due by 5 PM Send via email to Pete Zaback [email protected] (For now, no late penalty - just send ASAP)

√Wed Sept 5 - Notes for Lecture 5 posted online - HW#2 posted online & sent via email

& handed out in class

Fri Sept 14 - HW#2 Due by 5 PM

Fri Sept 21 - Exam #1

Page 4: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 4

Chp 3- Sequence Alignment

SECTION II SEQUENCE ALIGNMENT

Xiong: Chp 3 Pairwise Sequence Alignment

• √Evolutionary Basis • √Sequence Homology versus Sequence Similarity • √Sequence Similarity versus Sequence Identity • Methods - cont• Scoring Matrices• Statistical Significance of Sequence

AlignmentAdapted from Brown and Caragea, 2007, with some slides from: Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page.

Page 5: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 5

Methods

• √Global and Local Alignment• √Alignment Algorithms• √Dot Matrix Method• Dynamic Programming Method - cont

• Gap penalities• DP for Global Alignment• DP for Local Alignment

• Scoring Matrices• Amino acid scoring matrices

• PAM• BLOSUM• Comparisons between PAM & BLOSUM

• Statistical Significance of Sequence Alignment

Page 6: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 6

Dynamic Programming - 4 Steps:

1. Define score of optimal alignment, using recursion

2. Initialize and fill in a DP matrix for storing optimal scores of subproblems, by solving smallest subproblems first (bottom-up approach)

3. Calculate score of optimal alignment(s)4. Trace back through matrix to recover

optimal alignment(s) that generated optimal score

Page 7: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 7

S(i,0) = −i ⋅γ S(0, j) = − j ⋅γInitial conditions:

Recursive definition: For 1 i N, 1 j M:

1- Define Score of Optimal Alignment using Recursion

S(i, j) = Score of optimal alignment of x1..i and y1..j

x1..i = Prefix of length i of xy1.. j = Prefix of length j of y

Define:

= Gap penalty

= Match Reward = Mismatch Penalty = Gap penalty

(xi,yj) = or

S(i, j) = maxS(i −1, j −1) +σ (xi ,y j )S(i −1, j) −γS(i, j −1) −γ

⎧ ⎨ ⎪

⎩ ⎪

Page 8: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 8

S(N,M)

S(0,0)=0

00 1 N

1

M

• Construct sequence vs sequence matrix • Fill in from [0,0] to [N,M] (row by row), calculating best possible score for each alignment ending at residues at [i,j]

2- Initialize & Fill in DP Matrix for Storing Optimal Scores of

Subproblems

S(i,j)

Page 9: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 9

x1 x2 . . . xi-1 xi

y1 y2 . . . yj-1 yj

S(i-1,j-1) + (xi,yj)

x1 x2 . . . xi-1 xi

y1 y2 . . . yj —

S(i-1,j) -

x1 x2 . . . xi — y1 y2 . . . yj-1 yj

S(i,j-1) -

xi aligns to yj xi aligns to a gap yj aligns to a gap

1 of 3 cases optimal score for this subproblem:

How do we calculate S(i,j)? i.e., Score for alignment of x[1..i] to y[1..j]?

Page 10: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 10

Specific Example:

Case 1: Line up xi with yj

x: C - T C G C A y: C A T - T C A

i - 1 i

jj - 1

x: C - T C G C - A y: C A T - T C A -

Case 2: Line up xi with space i - 1 i

j

x: C - T C G C A - y: C A T - T C - A

Case 3: Line up yj with space i

jj -1

Match Bonus

Space Penalty

Space Penalty

Scoring Consequence?

Note: I changed sequences on this slide (to match the rest of DP example)

Page 11: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 11

S(N,M)

S(0,0)=0

S(i,j)

00 1 N

1

M

S(i,0) = −i ⋅γS(0, j) = − j ⋅γ

Initialization

S(i, j) = maxS(i −1, j −1) +σ (xi ,y j )S(i −1, j) −γS(i, j −1) −γ

⎧ ⎨ ⎪

⎩ ⎪

Recursion

-

-

S(i-1,j)S(i-1,j-1)

S(i,j-1)

+ (xi,yj) = or

= Match Reward = Mismatch Penalty = Gap penalty

Ready? Fill in DP Matrix

Keep track of dependencies of scores (in a pointer matrix)

Page 12: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 12

λ C T C G C A G C

AC

T

T

CA

C

0 -5 -10 -15 -20 -25 -30 -35 -40

-5

-10

-15

-20

-25-30-35

10 5

λ

+10 for match, -2 for mismatch, -5 for space

Fill in the DP matrix !!

Page 13: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 13

+10 for match, -2 for mismatch, -5 for space

3- Calculate Score S(N,M) of Optimal Alignment - for Global Alignment

0 -5 -10 -15 -20 -25 -30 -35 -40-5 10 5 0 -5 -10 -15 -20 -25

-10 5 8 3 -2 -7 0 -5 -10-15 0 15 10 5 0 -5 -2 -7-20 -5 10 13 8 3 -2 -7 -4-25 -10 5 20 15 18 13 8 3-30 -15 0 15 18 13 28 23 18-35 -20 -5 10 13 28 23 26 33

λ C T C G C A G C

C

AC

T

T

CA

λ

Page 14: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 14

4- Trace back through matrix to recover optimal alignment(s) that generated the optimal score

How? "Repeat" alignment calculations in reverse order, starting at from position with highest score and following path, position by position, back through matrix

Result? Optimal alignment(s) of sequences

Page 15: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 15

Traceback - for Global Alignment

Start in lower right corner & trace back to upper left

Each arrow introduces one character at end of alignment:• A horizontal move puts a gap in left sequence• A vertical move puts a gap in top sequence• A diagonal move uses one character from each

sequence

Page 16: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 16

0 -5 -10 -15 -20 -25 -30 -35 -40

-5 10 5 0 -5 -10 -15 -20 -25-10 5 8 3 -2 -7 0 -5 -10-15 0 15 10 5 0 -5 -2 -7-20 -5 10 13 8 3 -2 -7 -4-25 -10 5 20 15 18 13 8 3-30 -15 0 15 18 13 28 23 18-35 -20 -5 10 13 28 23 26 33

λ C T C G C A G C

C

AC

TTCA

λ

Can have >1 optimal alignment; this example has 2

Traceback to Recover Alignment

Page 17: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 17

0 -5 -10 -15 -20 -25 -30 -35 -40

-5 10 5 0 -5 -10 -15 -20 -25-10 5 8 3 -2 -7 0 -5 -10-15 0 15 10 5 0 -5 -2 -7-20 -5 10 13 8 3 -2 -7 -4-25 -10 5 20 15 18 13 8 3-30 -15 0 15 18 13 28 23 18-35 -20 -5 10 13 28 23 26 33

λ C T C G C A G C

C

AC

TTCA

λ

Where did red arrows come from?

Traceback to Recover Alignment

Page 18: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 18

0 -5 -10 -15 -20 -25 -30 -35 -40

-5 10 5 0 -5 -10 -15 -20 -25-10 5 8 3 -2 -7 0 -5 -10-15 0 15 10 5 0 -5 -2 -7-20 -5 10 13 8 3 -2 -7 -4-25 -10 5 20 15 18 13 8 3-30 -15 0 15 18 13 28 23 18-35 -20 -5 10 13 28 23 26 33

λ C T C G C A G C

C

AC

TTCA

λ

• Where did 33 come from? Match = 10, so 33-10= 23 Must have come from diagonal• Where did 23 come from? (Not a match)

Left? 28-5= 23; Diag? 13-2= 11; Top? 8-5= 3

Traceback to Recover Alignment

+10 for match, -2 for mismatch, -5 for space

Page 19: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 19

0 -5 -10 -15 -20 -25 -30 -35 -40

-5 10 5 0 -5 -10 -15 -20 -25-10 5 8 3 -2 -7 0 -5 -10-15 0 15 10 5 0 -5 -2 -7-20 -5 10 13 8 3 -2 -7 -4-25 -10 5 20 15 18 13 8 3-30 -15 0 15 18 13 28 23 18-35 -20 -5 10 13 28 23 26 33

λ C T C G C A G C

C

AC

TTCA

λ

• Where did 8 come from? Two possibilities: 13-5= 8 or 10-2=8

• Then, follow both paths

Traceback to Recover Alignment

+10 for match, -2 for mismatch, -5 for space

Page 20: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 20

0 -5 -10 -15 -20 -25 -30 -35 -40

-5 10 5 0 -5 -10 -15 -20 -25-10 5 8 3 -2 -7 0 -5 -10-15 0 15 10 5 0 -5 -2 -7-20 -5 10 13 8 3 -2 -7 -4-25 -10 5 20 15 18 13 8 3-30 -15 0 15 18 13 28 23 18-35 -20 -5 10 13 28 23 26 33

λ C T C G C A G C

C

AC

TTCA

λ

Traceback to Recover Alignment

G with -

C with C

Great - but what are the alignments? #1

A with A

C with C

C with -

T with T

C with C

- with A

G with T

Page 21: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 21

0 -5 -10 -15 -20 -25 -30 -35 -40

-5 10 5 0 -5 -10 -15 -20 -25-10 5 8 3 -2 -7 0 -5 -10-15 0 15 10 5 0 -5 -2 -7-20 -5 10 13 8 3 -2 -7 -4-25 -10 5 20 15 18 13 8 3-30 -15 0 15 18 13 28 23 18-35 -20 -5 10 13 28 23 26 33

λ C T C G C A G C

C

AC

TTCA

λ

Traceback to Recover Alignment

G with -

C with C

Great - but what are the alignments? #2

A with A

C with C

T with T

C with C

- with A

C with T

G with -

Page 22: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 22

Top: C T C G C A G C Left: C A T T C A C

What are the 2 Global Alignments with Optimal Score = 33?

C - T C G C A G C

C - T C G C A G C 1:

2:

• A horizontal move puts a gap in left sequence• A vertical move puts a gap in top sequence• A diagonal move uses one character from each sequence

Page 23: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 23

Top: C T C G C A G C Left: C A T T C A C

What are the 2 Global Alignments with Optimal Score = 33?

C - T C G C A G C C A T T - C A - C

C - T C G C A G C C A T - T C A - C1:

2:

Check the scores: +10 for match, -2 for mismatch, -5 for space

Page 24: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 24

0 -5 -10 -15 -20 -25 -30 -35 -40

-5 10 5 0 -5 -10 -15 -20 -25-10 5 8 3 -2 -7 0 -5 -10-15 0 15 10 5 0 -5 -2 -7-20 -5 10 13 8 3 -2 -7 -4-25 -10 5 20 15 18 13 8 3-30 -15 0 15 18 13 28 23 18-35 -20 -5 10 13 28 23 26 33

λ C T C G C A G C

C

AC

TTCA

λ

or, Check Traceback?

dh d

d

vd

d

1dh

2d

h

• h= horizontal move puts a gap in left sequence• v = vertical move puts a gap in top sequence• d = diagonal move uses one character from each sequence

Page 25: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 25

Local Alignment: Motivation• To "ignore" stretches of non-coding DNA:

• Non-coding regions (if "non-functional") are more likely to contain mutations than coding regions

• Local alignment between two protein-encoding sequences is likely to be between two exons

• To locate protein domains or motifs:• Proteins with similar structures and/or similar functions

but from different species (for example), often exhibit local sequence similarities

• Local sequence similarities may indicate ”functional modules”

Non-coding - "not encoding protein"Exons - "protein-encoding" parts of genes vs Introns = "intervening sequences" - segments of eukaryotic

genes that "interrupt" exons Introns are transcribed into RNA, but are later removed by

RNA processing & are not translated into protein

Page 26: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 26

Local Alignment: Example

Best local alignment:

Match: +2 Mismatch or space: -1

Score = 5

G G T C T G A GA A A C G A

G G T C T G A GA A A C – G A -

Page 27: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 27

Local Alignment: Algorithm

1) Initialize top row & leftmost column of matrix with "0"

2) Fill in DP matrix: In local alignment, no negative scores Assign "0" to cells with negative scores

3) Optimal score? in highest scoring cell(s)

4) Optimal alignment(s)? Traceback from each cell containing the optimal score, until a cell with "0" is reached (not just from lower right corner)

This slide has been changed!

Page 28: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 28

Local Alignment DP: Initialization & Recursion

S 0,0( ) = 0

S i, j( ) = maxS i −1, j −1( )+σ xi , y j( )S i −1, j( ) −γS i, j −1( ) −γ

0

⎪ ⎪

⎪ ⎪

S(i,0) = 0 S(0, j) = 0

New Slide

Page 29: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 29

0 0 0 0 0 0 0 0 00 1 0 1 0 1 0 0 10 0 0 0 0 0 2 0 00 0 1 0 0 0 0 1 00 0 1 0 0 0 0 0 00 1 0 2 0 1 0 0 10 0 0 0 1 0 2 0 00 1 0 1 0 2 0 1 1

λ C T C G C A G C

AC

T

T

CA

C

λ

Filling in DP Matrix for Local Alignment No negative scores - fill in "0"

+1 for match, -1 for mismatch, -5 for space

Page 30: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 30

0 0 0 0 0 0 0 0 00 1 0 1 0 1 0 0 10 0 0 0 0 0 2 0 00 0 1 0 0 0 0 1 00 0 1 0 0 0 0 0 00 1 0 2 0 1 0 0 10 0 0 0 1 0 2 0 00 1 0 1 0 2 0 1 1

λ C T C G C A G C

AC

T

T

CA

C

λ

+1 for match, -1 for mismatch, -5 for space

Traceback - for Local Alignment

1

23

4

Page 31: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 31

C T C G C A G C C A T T C A C

What are the 4 Local Alignments with Optimal Score = 2?

C T C G C A G C1: C T C G C A G C2: C T C G C A G C3: C T C G C A G C4:

Page 32: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 32

C T C G C A G C C A T T C A C

What are the 4 Local Alignments with Optimal Score = 2?

C T C G C A G C - - - - C A T T

1: C T C G C A G C C A T T C A C

2: C T C G C A G C T T C A C

3: C T C G C A G C T T C A C

4:Check the scores: +1 for match, -1 for mismatch, -5 for space

Page 33: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 33

Some Results re: Alignment Algorithms

(for ComS, CprE & Math types)• Most pairwise sequence alignment

problems can be solved in O(mn) time• Space requirement can be reduced to O(m+n),

while keeping run-time fixed [Myers88]• Highly similar sequences can be aligned in O (dn)

time, where d measures the distance between the sequences [Landau86]

for Biologists: Big O notation • used when analyzing algorithms for efficiency• refers to time or number of steps it takes to

solve a problem • expressed as a function of size of the problem

Page 34: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 34

Affine Gap Penalty FunctionsAffine Gap Penalties = Differential Gap Penalties

used to reflect cost differences between opening a gap and extending an existing gap

Total Gap Penalty is linear function of gap length:

W = + X (k - 1) where = gap opening penalty = gap extension penalty k = length of gap

Sometimes, a Constant Gap Penalty is used, but it is usually least realistic than the Affine Gap Penalty

Can also be solved in O(nm) time using DP

Page 35: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 35

Methods

• √Global and Local Alignment• √Alignment Algorithms• √Dot Matrix Method• √Dynamic Programming Method - cont

• Gap penalities• DP for Global Alignment• DP for Local Alignment

• Scoring Matrices• Amino acid scoring matrices

• PAM• BLOSUM• Comparisons between PAM & BLOSUM

• Statistical Significance of Sequence Alignment

Page 37: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 37

PAM Matrix

PAM = Point Accepted Mutation

relies on "evolutionary model" based on observed differences in closely related proteins

• Model includes defined rate for each type of sequence change

• Suffix number (n) reflects amount of "time" passed: rate of expected mutation if n% of amino acids had changed

• PAM1 - for less divergent sequences (shorter time)• PAM250 - for more divergent sequences (longer time)

Page 38: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 38

BLOSUM Matrix

BLOSUM = BLOck SUbstitution Matrix

based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins

• Doesn't rely on a specific evolutionary model• Suffix number (n) reflects expected similarity:

average % aa identity in the MSA from which the matrix was generated

• BLOSUM45 - for more divergent sequences• BLOSUM62 - for less divergent sequences

Page 39: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 39

PAM250 vs BLOSUM 62

See Text Fig 3.5 = PAM250Fig 3.6= BLOSUM62

Usually only 1/2 of matrix is displayed (it is symmetric)

Here: s(a,b) corresponds to score of aligning character a with character b

Page 40: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 40

Which is Better?PAM or BLOSUM

• PAM matrices• derived from evolutionary model• often used in reconstructing phylogenetic trees - but, not

very good for highly divergent sequences

• BLOSUM matrices• based on direct observations• more 'realistic" - and outperform PAM matrices in terms

of accuracy in local alignment

Page 41: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 41

Which Type of Matrix Should You Use?

Several other types of matrices available:• Gonnet & Jones-Taylor-Thornton:

• very robust in tree construction• "Best" matrix depends on task:

• different matrices for different applications

ADVICE: if unsure, try several different matrices & choose the one that gives best alignment result

Page 42: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 42

Sequence Alignment Statistics

• Distribution of similarity scores in sequence alignment is not a simple "normal" distribution

• "Gumble extreme value distribution" - a highly skewed normal distribution with a long tail

Page 43: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 43

How Assess Statistical Significance

of an Alignment?• Compare score of an alignment with distribution of

scores of alignments for many 'randomized' (shuffled) versions of the original sequence

• If score is in extreme margin, then unlikely due to random chance

• P-value = probability that original alignment is due to random chance (lower P is better)

P = 10-5 - 10-50 sequences have clear homologyP > 10-1 no better than random

Check out: PRSS (Probability of Random Shuffles)http://www.ch.embnet.org/software/PRSS_form.html

Page 44: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 44

Chp 4- Database Similarity Searching

SECTION II SEQUENCE ALIGNMENT

Xiong: Chp 4 Database Similarity Searching

• Unique Requirements of Database Searching• Heuristic Database Searching• Basic Local Alignment Search Tool (BLAST)• FASTA• Comparison of FASTA and BLAST• Database Searching with Smith-Waterman

Method

Page 45: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 45

Exhaustive vs Heuristic Methods

Exhaustive - tests every possible solution• guaranteed to give best answer

(identifies optimal solution)• can be very time/space intensive!• e.g., Dynamic Programming

as in Smith-Waterman algorithm

Heuristic - does NOT test every possibility• no guarantee that answer is best

(but, often can identify optimal solution)• sacrifices accuracy (potentially) for speed• uses "rules of thumb" or "shortcuts" • e.g., BLAST & FASTA

Page 46: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 46

Today's Lab: focus on BLAST Basic Local Alignment Search Tool

STEPS:• Create list of very possible "word" (e.g., 3-11 letters)

from query sequence • Search database to identify sequences that contain

matching words • Score match of word with sequence, using a

substitution matrix• Extend match (seed) in both directions, while

calculating alignment score at each step• Continue extension until score drops below a threshold

(due to mismatches)High Scoring Segment Pair (HSP) - contiguous aligned

segment pair (no gaps)

Page 47: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 47

Lab3: focus on BLAST Basic Local Alignment Search Tool

BLAST Results?

• Original version of BLAST? List of HSPs = Maximum Scoring

Pairs

• More recent, improved version of BLAST? Allows gaps: Gapped Alignment

How? Allows score to drop below threshold, (but only temporarily)

Page 48: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 48

BLAST - a few details

Developed by Stephen Altschul at NCBI in 1990

• Word length? • Typically: 3 aa for protein sequence

11 nt for DNA sequence• Substitution matrix?

• Default is BLOSUM62• Can change under Algorithm Parameters• Choose other BLOSUM or PAM matrices

• Stop-Extension Threshold? • Typically: 22 for proteins 20 for DNA

Page 49: BCB 444/544

9/7/07BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 49

BLAST - Statistical Significance?

1. E-value: E = m x n x Pm = total number of residues in databasen = number of residues in query sequenceP = probability that an HSP is result of random

chancelower E-value, less likely to result from random chance, thus higher significance

• Bit Score: S' normalized score, to account for differences in sequence length & size of database

3. Low Complexity Maskingremove repeats that confound scoring


Top Related