pair-wise sequence alignment what happened to the sequences of similar genes? random mutation...

31
Pair-wise Sequence Alignment hat happened to the sequences of similar gen random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI ++ P++ ++DV+SY Seq. 2: 451 EVI---EHKPYNHKADVFSYA Homology vs. similarity What is pair-wise sequence alignment? •Why pair-wise alignment?

Upload: elijah-webster

Post on 25-Dec-2015

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

Pair-wise Sequence Alignment

•What happened to the sequences of similar genes?random mutationdeletion, insertion

Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI ++ P++ ++DV+SY Seq. 2: 451 EVI---EHKPYNHKADVFSYA

•Homology vs. similarity

•What is pair-wise sequence alignment?

•Why pair-wise alignment?

Page 2: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

Some concepts

•Optimal alignment

•Global alignment

•Gaps

•Local alignment

•Gap penalty

•Substitution matrix

Page 3: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

Dotplot

•What dotplot shows

•What dotplot does not show

•A simplified representation

Page 4: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

Sequence Alignment

•Dynamic programminga method for some optimization problemsdetermine a scoring schemebest solution based on a scoring scheme

•Total number of possible alignments for length n~ 22n / sqrt(2n)

•Needleman-Wunsch - global

Page 5: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

•Questions•How does it work?•How to come up with a DP approach to an exponential problem? •How to implement a DP approach?

Page 6: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

Dynamic Programming Algorithm

F(i,j) = max

•Break a problem into subproblems•Solve each subproblem separately

F(i-1,j-1) + s(xi, yj)F(i,j-1) + gF(i-1,j) + g

s(xi, yj) : substitution score for aligning xi with yj

g : gap penalty

F(i,j) : The max score for aligning 1st i symbols of sequence 1 with 1st j symbols of sequence 2

Page 7: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

Example

•Initialization• matrix filling (scoring)•Trace back

ACTCG ACAGTAG

Match: 1Mismatch: 0Gap: -1

Page 8: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

0 -1 -2 -3 -4 -5 -6 -7

-1 1 0 -1 -2 -3 -4 -5

-2 0 2 1 0 -1 -2 -3

-3 -1 1 2 1 1 0 -1

-4 -2 0 1 2 1 1 0

-5 -3 -1 0 2 2 1 2

A C A G T A G

A

C

T

C

G

i=0

i=1

i=2

i=3

i=4

i=5

j =0, 1, 2, 3, 4, 5, 6, 7

Page 9: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

Local Alignment: Smith- Waterman•Biological significance

F(i,j) = max F(i-1,j-1) + s(xi, yj)F(i,j-1) + gF(i-1,j) + g

0

•O(n2) time

Page 10: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 1 0 0

0 0 0 1 1 0 0 0 0 0 2 1

0 0 0 0 0 0 0 0 0 1 0 1

0 1 1 0 0 0 1 0 1 0 0 0

0 0 0 0 0 1 0 2 1 0 0 1

0 1 1 0 0 0 2 0 3 2 1 0

0 0 0 0 0 1 1 3 2 2 1 2

0 1 1 0 0 0 2 2 4 3 2 1

A A C C T A T A G C T

G

C

G

A

T

A

T

A

AACCTATAGCT ||||GCGATATA

Local Alignment

Page 11: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

Issues in alignment•Different ways to fill the table

•Multiple optimal alignments

•s(xi, yj) – from substitution matrix

• gap penalty:linear: w(k) = gk

Affine: w(k) =h + gk, k>=1

0, k=0

Page 12: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

Gap models•New gap vs. gap extension

•A gap of length k vs. k gaps of length 1

•1 insersion / deletion event vs. k events

• gap penalty:linear: w(k) = gk

Affine: w(k) =h + gk, k>=1

0, k=0

Page 13: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

Affine Gap Penalty

M( i, j ) : best score when xi aligned with yjIx (i, j) : best score when xi aligned with a gapIy (i, j) : best score when yj aligned with a gap

•Aligning 1st i symbols of x with 1st j symbols of y

•? Wrong with the F(i,j) formula if AGP is used

•Three matrices

Page 14: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

DP for global alignment for AGP

M (i, j) = maxM(i-1, j-1) + s(xi, yj)Ix (i-1, j-1) + s(xi, yj)ly (i-1, j-1) + s(xi, yj)

Ix (i, j) = maxM(i-1, j) + h + gIy(i-1, j) + h + glx (i-1, j) + g

Iy (i, j) = maxM(i, j-1) + h + gIx(i, j-1) + h + gly (i, j-1) + g

Page 15: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

DP for global alignment using AGP•Initialization

M(0, 0) =0Ix(i, 0) = h+gily(0, j) = h+gjall other cases: -

•Start at the largest element in the three matricesM(m, n), Ix(m, n), ly(m, n)

•Traceback to (0,0)

Page 16: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

DP for local alignment for AGP

M (i, j) = maxM(i-1, j-1) + s(xi, yj)Ix (i-1, j-1) + s(xi, yj)ly (i-1, j-1) + s(xi, yj)0

Ix (i, j) = maxM(i-1, j) + h + gIy(i-1, j) + h + g // ignoredlx (i-1, j) + g

Iy (i, j) = maxM(i, j-1) + h + gIx(i, j-1) + h + g // ignoredly (i, j-1) + g

Page 17: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

DP for Local Alignment for AGP•Initialization

M(0, 0) =0Ix(i, 0) = 0ly(0, j) = 0all other cases: -

•Start at the largest M(i, j), Ix(i, j), ly(i, j)

•Traceback till M(i, j) = 0

Page 18: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

Database searching methods

•Need more efficient methods

•Dynamic programming - O(n2L), L: size of database

•Why DP is slow?

•Ideas: Regions that are similar likely to share short identical subsequences

•Quick search for the regions, then check carefully locally

Page 19: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

FASTA related methods

•Word, word size (2,6), sensitivity vs. speed

•What are the words in the query also in target

•Pre-computed table that stores locations of words – “hashing”

•Heuristic approximation

1. Quick initial “guess” – common subsequences

•An example

Page 20: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

FASTA related methods

•Use Smith-Waterman method in a band, 32 aa wide around the best score

2. Find the region with high population of common words•Process diagonals, rescore, join regions, using gaps

3. Local alignment (DP) in the region identified

Page 21: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

Limitation of FASTA•Speed vs. sensitivity

•Can miss biologically significant similaritysome proteins do not share identical a.a.initial stepDifferent codons encodes same protein

•Identical words

Page 22: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

BLAST •Previous 2 kinds approaches

1. Word list•Incorporate similarity measurement for words

– PAM120e.g. ACDE

•Theoretically sound •search for common subsequences

•Scan for word occurrenceshash tableFinite state machine

(Stephen F. Altschul et al, Nucleic Acids Research 1997, 25(17) 3389-3402)

Page 23: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

BLAST

2. Extend words to HSP (locally optimal pairs)•Find additional words within threshold•Merge within distance A

3. Select significant HSPs, use DP in banded region

Page 24: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

Mini Presentations

1. Previous BLAST 2. Major concepts in BLAST 3. Statistical issue 4. Gapped local alignment –Gapped 5. Position-specific scoring matrix (PSSM) –

overall idea, architecture, multiple -alignment construction

6. PSSM – target frequency estimation, application to BLAST

(Stephen F. Altschul et al, Nucleic Acids Research 1997, 25(17) 3389-3402)

Page 25: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

Multiple Sequence Alignment•Motivation

•What is MSA?

•How do we extend knowledge of pair-wise alignment?

•An example: AGAC, AC, AGAGAC--AC

AGACAG--

ACAG

Some possibilitiesAG-- --AC AGAC

•Fix pair-wise alignment and then add? •Evaluate all the possible alignment of N sequences?

Page 26: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

•Sum of pairs (SP) scoring methodsGiven a alignment of N sequences, each of which has length L, in the LxN alignment:

Pair-wise sum for each column, then sum all columns

Scoring MSA

•Example(c(match)=1, c(mismatch)=-1, c(gap)=-2, c(gap,gap) =0

SP4=SP(I,-,I,V) = -2+1-1-2-2-1=-7SP = SP1 +SP2 + … + SP8

AQPILLLVALR-LL—-AK-ILLL-CPPVLILV

•SP tends to overweight a single mutationSP(A,A,A,C) = 0, SP(A,A,A,A) = 6

Page 27: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

•DP of N dimensions using SPTime: in the order of (LN)(2N-1)N2 ~ O((2L)NN2)

Extension of DP for N sequences •Extend F(i,j) for N dimensions

Page 28: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

STAR method

•DP provide optimal solution but costly

•Heuristic methods – STAR, CLUSTALW, …•Progressive alignment

•STAR- pair-wise - build similarity matrix- find a “star” sequence- use “star” to align other sequence- once gap, all time gap

Page 29: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

STAR method

•Example

Page 30: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

CLUSTAL family

•Build Similarity tree – “clustering”•Alignment starts at most similar sequences

•What are the disadvantages of STAR method?

1.Pair-wise alignment --> distance matrixFast approximate approach or DP

Page 31: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI

CLUSTALW2. Construct similarity tree, “the guide tree”

•Start with most similar sequences•Align group with group using pair-wise alignment•e.g.

3. Progressive alignment

UPGMA (un-weighted pair-group method using arithmetic average)