sequence alignment iii

18
Sequence Alignment III CIS 667 February 10, 2004

Upload: anastasia-howe

Post on 30-Dec-2015

31 views

Category:

Documents


3 download

DESCRIPTION

Sequence Alignment III. CIS 667 February 10, 2004. Extensions to the Basic Algorithm. We have seen that the basic dynamic programming algorithm can be used to solve Global alignment Semi-global alignment Local alignment - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sequence Alignment III

Sequence Alignment III

CIS 667 February 10, 2004

Page 2: Sequence Alignment III

Extensions to the Basic Algorithm

• We have seen that the basic dynamic programming algorithm can be used to solve Global alignment Semi-global alignment Local alignment

• We can extend the algorithm to more accurately reflect the cost of gap penalties

Page 3: Sequence Alignment III

General Gap Penalty Functions

• Gaps are caused by mutations It is more likely to have a single large

gap than several smaller ones Gap penalties should reflect that Let w(k) denote a gap penalty function

(the cost of a gap of k spaces) We have been using w(k) = bk - a linear

function

Page 4: Sequence Alignment III

General Gap Penalty Functions

• We can modify the basic algorithm to compute a score with a general gap penalty w (i.e. any function)

• The modified algorithm is slower, however - O(n3) The new algorithm scoring scheme is no

longer additive The space required is also larger

Page 5: Sequence Alignment III

Affine Gap Penalty Functions

• Can we do better than O(n3) and still have a reasonable function? Yes. We need to have w(k) kw(1) An affine function - w(k) = h + gk with

w(0) = 0 and h, g > 0 works Think of h as the cost of opening a gap

and g as the cost of extending a gap We can develop an algorithm with time

complexity O(n2)

Page 6: Sequence Alignment III

Gap Penalties - Overview

Imagine we want to align: CAGTCCAAGGTTCAGT

Bad alignment: C-A-G-T-----CCAAGGTTCAGT

Better alignment: --------CAGTCCAAGGTTCAGT

Gap cost with linear gap penalty (-2)

-16

-16

Gap cost with affine gap penalty (h = -2, k = -1)

-12

-9

Page 7: Sequence Alignment III

Multiple Sequence Alignment

• Once a protein sequence is newly determined, an important goal is to assign possible functions to it First search for similar sequences in the

DNA and protein sequence databases If more than one similar sequence is

found, the next step is to multiply align all of the sequences

Page 8: Sequence Alignment III

Multiple Sequence Alignment

• Multiple alignments are key starting point for Protein secondary structure prediction Residue accessibility Function

• Also provide the basis for the most sensitive sequence searching algorithms

Page 9: Sequence Alignment III

Multiple Sequence Alignment

• A multiple sequence alignment is simply an alignment that contains more than two sequences

MPQILLLMLR-LL-MK-ILLLMPPVLIL

Page 10: Sequence Alignment III

Multiple Sequence Alignment

• We must decide how to score a multiple alignment

• One possibility is the sum-of-pairs function Simply add up the pairwise scores of all

pairs in a column to get the score of the column Note that in multiple sequence alignment we

may have two spaces in a column - the score of (-,-) then is usually set to 0

Page 11: Sequence Alignment III

Multiple Sequence Alignment

• A straightforward dynamic programming approach to multiple sequence alignment results in an exponential algorithm Heuristics can be used to reduce the

complexity in most cases

Page 12: Sequence Alignment III

Multiple Sequence Alignment

• Automatic alignment programs such as CLUSTAL W can be used to produce multiple alignments

• The PSI-BLAST program uses multiple sequence alignments to make more sensitive searches of protein sequence databases than is possible with a single sequence

Page 13: Sequence Alignment III

PAM Matrices

• When comparing protein sequences, we need a more complex scoring scheme A mismatch with two amino acids with

similar biochemical properties should score higher than one with two dissimilar ones Evolution is more likely to result in a similar

amino acid (e.g. same size, both hydrophobic, etc.) replacing another

Page 14: Sequence Alignment III

PAM Matrices

• PAM - Point Accepted Mutations or Percent of Accepted Mutations 1-PAM matrix reflects an amount of

evolution producing on average one mutation per hundred amino acids

250-PAM matrix is suitable for comparing sequences that are 250 units of evolution apart Works well for long, weakly similar sequences Small values good for short, similar sequences

Page 15: Sequence Alignment III

PAM-250 Matrix

Page 16: Sequence Alignment III

BLOSUM Matrices

• Another widely used set of matrices is BLOSUM - Blocks Substitution Matrix BLOSUM is often better for highly

divergent sequences PAM better for more highly similar

sequences

Page 17: Sequence Alignment III

BLAST

• BLAST - Basic Local Alignment Search Tool is a family of sequence similarity tools Can be used to search sequence

databases worldwide Can be run locally, or via web-based

interface on a server Given a query sequence, BLAST returns

all matches above a user-defined threshold

Page 18: Sequence Alignment III

BLAST

• BLAST uses a heuristic technique Compile list of high-scoring words (use

PAM matrix to score words w characters long)

Search for matches in the database (use a hash table to speed up search) - call a match a seed

Extend the seeds in both directions until the score of the extension falls below a limit