bcb 444/544

48
9/14/07 BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 1 BCB 444/544 Lecture 11 First BLAST vs FASTA Plus some Gene Jargon Multiple Sequence Alignment (MSA) #11_Sept14

Upload: weylin

Post on 13-Jan-2016

48 views

Category:

Documents


0 download

DESCRIPTION

BCB 444/544. Lecture 11 First BLAST vs FASTA Plus some Gene Jargon Multiple Sequence Alignment (MSA) #11_Sept14. Required Reading ( before lecture). √ Mon Sept 10 - for Lecture 9/10 BLAST variations; BLAST vs FASTA, SW Chp 4 - pp 51-62 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 1

BCB 444/544

Lecture 11

First BLAST vs FASTAPlus some Gene Jargon

Multiple Sequence Alignment (MSA)

#11_Sept14

Page 2: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 2

√Mon Sept 10 - for Lecture 9/10 BLAST variations; BLAST vs FASTA, SW

• Chp 4 - pp 51-62

√Wed Sept 12 - for Lecture 11 & Lab 4Multiple Sequence Alignment (MSA)

• Chp 5 - pp 63-74

Fri Sept 14 - for Lecture 12Position Specific Scoring Matrices & Profiles

• Chp 6 - pp 75-78 (but not HMMs)

• Good Additional Resource re: Sequence Alignment? • Wikipedia:

http://en.wikipedia.org/wiki/Sequence_alignment

Required Reading (before lecture)

Page 3: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 3

Assignments & Announcements - #1

Revised Grading Policy has been sent via email Please review!

√Mon Sept 10 - Lab 3 Exercise due 5 PM: to: [email protected]

?Thu Sept 13 - Graded Labs 2 & 3 will be returned at beginning of

Lab 4 Fri Sept 14 - HW#2 due by 5 PM (106 MBB)

Study Guide for Exam 1 will be posted by 5

PM

Page 4: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 4

Review: Gene Jargon #1 (for HW2, 1c)

Exons = "protein-encoding" (or "kept" parts) of eukaryotic genes vs Introns = "intervening sequences"

= segments of eukaryotic genes that "interrupt" exons

• Introns are transcribed into pre-RNA• but are later removed by RNA processing • & do not appear in mature mRNA • so are not translated into protein

Page 5: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 5

Assignments & Announcements - #2

Mon Sept 17 - Answers to HW#2 will be posted by 5 PM

Thu Sept 20 - Lab = Optional Review Session for Exam

Fri Sept 21 - Exam 1 - Will cover:• Lectures 2-12 (thru Mon Sept 17)• Labs 1-4• HW2• All assigned reading:

Chps 2-6 (but not HMMs) Eddy: What is Dynamic Programming

Page 6: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 6

Chp 4- Database Similarity Searching

SECTION II SEQUENCE ALIGNMENT

Xiong: Chp 4

Database Similarity Searching

• √Unique Requirements of Database Searching• √Heuristic Database Searching• √Basic Local Alignment Search Tool (BLAST)• FASTA• Comparison of FASTA and BLAST• Database Searching with Smith-Waterman

Method

Page 7: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 7

Why search a database?

• Given a newly discovered gene,• Does it occur in other species?• Is its function known in another species?

• Given a newly sequenced genome, which regions align with genomes of other organisms?• Identification of potential genes• Identification of other functional parts of

chromosomes

• Find members of a multigene family

Page 8: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 8

FASTA and BLAST

• FASTA • user defines value for k = word length• Slower, but more sensitive than BLAST at lower values of k,

(preferred for searches involving a very short query sequence)

• BLAST family • Family of different algorithms optimized for particular types of

queries, such as searching for distantly related sequence matches

• BLAST was developed to provide a faster alternative to FASTA without sacrificing much accuracy

•Both FASTA, BLAST are based on heuristics •Tradeoff: Sensitivity vs Speed•DP is slower, but more sensitive

Page 9: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9

BLAST algorithms can generate both "global" and "local" alignments

Global alignment

Local alignment

Page 10: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 10

BLAST - a Family of Programs:Different BLAST "flavors"

• BLASTP - protein sequence query against protein DB

• BLASTN - DNA/RNA seq query against DNA DB (GenBank)

• BLASTX - 6-frame translated DNA seq query against protein DB

• TBLASTN - protein query against 6-frame DNA translation

• TBLASTX - 6-frame DNA query to 6-frame DNA translation

• PSI-BLAST - protein "profile" query against protein DB

• PHI-BLAST - protein pattern against protein DB

• Newest: MEGA-BLAST - optimized for highly similar sequences

http://www.ncbi.nlm.nih.gov/blast/producttable.shtml

Which tool should you use?

Page 11: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 11

Detailed Steps in BLAST algorithm

1. Remove low-complexity regions (LCRs)

2. Make a list (dictionary): all words of length 3aa or 11 nt

3. Augment list to include similar words

4. Store list in a search tree (data structure)

5. Scan database for occurrences of words in search tree

6. Connect nearby occurrences

7. Extend matches (words) in both directions

8. Prune list of matches using a score threshold

9. Evaluate significance of each remaining match

10. Perform Smith-Waterman to get alignment

Page 12: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 12

1: Filter low-complexity regions (LCRs)

⎟⎟⎟

⎜⎜⎜

⎛=

∏i

iN n

L

LK

!

!log1

Window length (usually 12)

Alphabet size (4 or 20)

Frequency of ith letter in the window

• Low complexity regions, transmembrane regions and coiled-coil regions often display significant similarity without homology.

• Low complexity sequences can yield false positives.

• Screen them out of your query sequences! When appropriate!

K = computational complexity; varies from 0 (very low complexity)to 1 (high complexity)

e.g., for GGGG:L! = 4!=4x3x2x1= 24nG=4 nT=nA=nC=0 ni! = 4!x0!x0!x0! = 24K=1/4 log4 (24/24) = 0

For CGTA: K=1/4 log4(24/1) = 0.57

This slide has been changed!

Page 13: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 13

2: List all words in query

YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQYGG GGF GFM FMT MTS TSE SEK …

Page 14: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 14

3: Augment word list

YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQYGG GGF GFM FMT MTS TSE SEK …

AAAAABAAC

YYY

203 = 8000possible matches

Page 15: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 15

3: Augment word list

G G FA A A0 + 0 + -2 = -2

BLOSUM62 scores Non-match

G G FG G Y6 + 6 + 3 = 15

Match

A user-specified threshold, T, determines which 3-letter words are considered matches and non-matches

Page 16: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 16

3: Augment word list

YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQYGG GGF GFM FMT MTS TSE SEK …

GGIGGLGGMGGFGGWGGY…

Page 17: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 17

3: Augment word list

Observation:

Selecting only words with score > T greatly reduces number of possible matches

otherwise, 203 for 3-letter words from amino acid sequences!

Page 18: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 18

Example

A R N D C Q E G H I L K M F P S T W Y VA 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

Find all words that match EAM with a score greater than or equal to 11

EAM 5 + 4 + 5 = 14DAM 2 + 4 + 5 = 11 QAM 2 + 4 + 5 = 11ESM 5 + 1 + 5 = 11EAL 5 + 4 + 2 = 11

Page 19: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 19

4: Store words in search tree

Search tree

Augmented list of query words

“Does this query contain GGF?”

“Yes, at position 2.”

Page 20: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 20

Search tree

G

G

L MF W Y

GGFGGLGGMGGWGGY

Page 21: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 21

Example

Put this word list into a search tree

DAMQAMEAMKAMECMEGMESMETMEVMEAIEALEAV

D Q E K

A A A G S T V AC

M M M M M M MM

LM

I V

Page 22: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 22

5: Scan the database sequences

Database sequence

Que

ry s

eque

nce

Page 23: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 23

Example

Scan this "database" for occurrences of your words

MKFLILLFNILCLDAMLAADNHGVGPQGASGVDPITFDINSNQTGPAFLTAVEAIGVKYLQVQHGSNVNIHRLVEGNVKAMENAE AMPQLSVDA M

Page 24: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 24

6: Connect nearby occurences(diagonal matches in Gapped

BLAST)

Database sequence

Qu

ery

seq

uen

ce

Two dots are connected IFF if they are less than A letters apart & are on diagonal

Page 25: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 25

7: Extend matches in both directions

DB

Scan

Page 26: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 26

7: Extend matches, calculating score at each step

• Each match is extended to left & right until a negative BLOSUM62 score is encountered• Extension step typically accounts for > 90% of

execution time

L P P Q G L L Query sequenceM P P E G L L Database sequence <word> 7 2 6 BLOSUM62 scores word score = 15<--- --->2 7 7 2 6 4 4 HSP SCORE = 32

(High Scoring Pair)

Page 27: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 27

8: Prune matches

• Discard all matches that score below defined threshold

Page 28: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 28

9: Evaluate significance

• BLAST uses an analytical statistical significance calculation

RECALL:1. E-value: E = m x n x P

m = total number of residues in databasen = number of residues in query sequenceP = probability that an HSP is result of random chance

lower E-value, less likely to result from random chance,

thus higher significance

• Bit Score: S' =

normalized score, to account for differences in size of database (m) & sequence length(n); Note (below) that bit score is linearly related to raw alignment score, so: higher S' means alignment has higher significance

This slide has been changed!

S'= ( X S - ln K)/ln2 where: = Gumble distribution constant S = raw alignment score

K = constant associated with scoring matrixFor more details - see text & BLAST tutorial

Page 29: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 29

10: Use Smith-Waterman algorithm (DP) to generate alignment

• ONLY significant matches are re-analyzed using Smith-Waterman DP algorithm.• Alignments reported by BLAST are produced by

dynamic programming

Page 30: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 30

BLAST: What is a "Hit"?

• A hit is a w-length word in database that aligns with a word from query sequence with score > T

• BLAST looks for hits instead of exact matches • Allows word size to be kept larger for speed, without

sacrificing sensitivity

• Typically, w = 3-5 for amino acids, w = 11-12 for DNA

• T is the most critical parameter:• ↑T ↓ “background” hits (faster)• ↓T ↑ ability to detect more distant relationships

(at cost of increased noise)

Page 31: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 31

Tips for BLAST Similarity Searches

• If you don’t know, use default parameters first• Try several programs & several parameter settings• If possible, search on protein sequence level

• Scoring matrices:PAM1 / BLOSUM80: if expect/want less divergent proteinsPAM120 / BLOSUM62: "average" proteinsPAM250 / BLOSUM45: if need to find more divergent

proteins

• Proteins: >25-30% identity (and >100aa) -> likely related15-25% identity -> twilight zone<15% identity -> likely unrelated

Page 32: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 32

Practical Issues

Searching on DNA or protein level? In general,

protein-encoding DNA should be translated!

• DNA yields more random matches:• 25% for DNA vs. 5% for proteins

• DNA databases are larger and grow faster• Selection (generally) acts on protein level

• Synonymous mutations are usually neutral • DNA sequence similarity decays faster

Page 33: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 33

BLAST vs FASTA

• Seeding: • BLAST integrates scoring matrix into first phase• FASTA requires exact matches (uses hashing)

• BLAST increases search speed by finding fewer, but better, words during initial screening phase

• FASTA uses shorter word sizes - so can be more sensitive

• Results: • BLAST can return multiple best scoring alignments• FASTA returns only one final alignment

Page 34: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 34

BLAST & FASTA References

• FASTA - developed first

• Pearson & Lipman (1988) Improved Tools for Biological Sequence Comparison.PNAS 85:2444- 2448

• BLAST• Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215

(1990)• Altschul, Madden, Schaffer, Zhang, Zhang, Miller,

Lipman (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-402

Page 35: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 35

BLAST Notes - & DP Alternatives

• BLAST uses heuristics: it may miss some good matches • But, it’s fast: 50 - 100X faster than Smith-Waterman (SW) DP• Large impact:

• NCBI’s BLAST server handles more than 100,000 queries/day

• Most used bioinformatics program in the world! But - Xiong says: "It has been estimated that for some families of

protein sequences BLAST can miss 30% of truly significant matches."

• Increased availability of parallel processing has made

DP-based approaches feasible:

• 2 DP-based web servers: both more sensitive than BLAST• Scan Protein Sequence:

http://www.ebi.ac.uk/scanps/index.htmlImplements modified SW optimized for parallel processing

• ParAlign www.paralign.org - parallel SW or heuristics

Page 36: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 36

NCBI - BLAST ProgramsGlossary & Tutorials

• http://www.ncbi.nlm.nih.gov/BLAST/

• http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html

• http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html

BLAST

Page 37: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 37

Chp 5- Multiple Sequence Alignment

SECTION II SEQUENCE ALIGNMENT

Xiong: Chp 5

Multiple Sequence Alignment

• Scoring Function• Exhaustive Algorithms• Heuristic Algorithms• Practical Issues

Page 38: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 38

Multiple Sequence Alignments

Credits for slides: Caragea & Brown, 2007; Fernandez-Baca, Heber &Hunter

Page 39: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 39

Overview

1. What is a multiple sequence alignment (MSA)?

2. Where/why do we need MSA?

3. What is a good MSA?

4. Algorithms to compute a MSA

Page 40: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 40

Multiple Sequence Alignment

• Generalize pairwise alignment of sequences to include > 2 homologous sequences

• Analyzing more than 2 sequences gives us much more information:

• Which amino acids are required? Correlated? • Evolutionary/phylogenetic relationships

• Similar to PSI-BLAST idea (not yet covered in lecture): use a set of homologous sequences to

provide more "sensitivity"

Page 41: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 41

What is a MSA?

ATTTG-ATTTGCAT-TGC

ATTTGATTTGCATT-GC

ATTT-G-ATTT-GCAT-T-GC

MSA Not a MSANot a MSA

Why?

Page 42: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 42

Definition: MSA

Given a set of sequences, a multiple sequence alignment is an assignment of gap characters, such that

• resulting sequences have same length• no column contains only gaps

ATTTG-ATTTGCAT-TGC

ATTTGATTTGCATT-GC

ATTT-G-ATTT-GCAT-T-GC

YES NONO

Page 43: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 43

Displaying MSAs: using CLUSTAL W

* entirely conserved column

: all residues have ~ same size AND hydropathy

. all residues have ~ same size OR hydropathy

RED: AVFPMILW (small)

BLUE: DE (acidic, negative chg)

MAGENTA: RHK (basic, positive chg)

GREEN: STYHCNGQ (hydroxyl + amine + basic)

Page 44: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 44

A single sequence that represents most common residue of each column in a MSA

Example:

What is a Consensus Sequence?

FGGHL-GFF-GHLPGFFGGHP-FG

FGGHL-GF

Steiner consensus seqence: Given sequences s1,…, sk, find a sequence s* that maximizes Σi S(s*,si)

Page 45: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 45

Applications of MSA

• Building phylogenetic trees• Finding conserved patterns, e.g.:

• Regulatory motifs (TF binding sites)• Splice sites• Protein domains

• Identifying and characterizing protein families• Find out which protein domains have same function

• Finding SNPs (single nucleotide polymorphisms) & mRNA isoforms (alternatively spliced forms)• DNA fragment assembly (in genomic sequencing)

Page 46: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 46

Application: Recover Phylogenetic Tree

NYLS

NYLS NFLS

What was series of events that led to current species?

Page 47: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 47

Application: Discover Conserved Patterns

Rationale: if they are homologous (derived from a common ancestor), they may be structurally equivalent

TATA box = transcriptional promoter element

Is there a conserved cis-acting regulatory sequence?

Page 48: BCB 444/544

9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 48

Goal: Characterize Protein Families

Which parts of globin sequences are most highly conserved?