an introduction to bioinformatics comparing biological sequences (3): database searching and...

70
An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

Post on 20-Dec-2015

232 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Comparing biological sequences (3): Database searching and Multiple alignment

Page 2: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

• Goal: find similar (homologous) sequences of a query sequence in a sequence of database

• Input: query sequence & database

• Output: hits (pairwise alignments)

Database searching

Page 3: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Database searching

• Core: pair-wise alignment algorithm• Speed (fast sequence comparison)• Relevance of the search results (statistical

tests)• Recovering all information of interest

• The results depend of the search parameters like gap penalty, scoring matrix.

• Sometimes searches with more than one matrix should be preformed

Page 4: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

What program to use for searching?

1) BLAST is fastest and easily accessed on the Web• limited sets of databases• nice translation tools (BLASTX, TBLASTN)

2) FASTA• precise choice of databases• more sensitive for DNA-DNA comparisons• FASTX and TFASTX can find similarities in sequences with

frameshifts

3) Smith-Waterman is slower, but more sensitive • known as a “rigorous” or “exhaustive” search• SSEARCH in GCG and standalone FASTA

Page 5: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

FASTA1) Derived from logic of the dot plot

• compute best diagonals from all frames of alignment

2) Word method looks for exact matches between words in query and test sequence• hash tables (fast computer technique)• DNA words are usually 6 bases• protein words are 1 or 2 amino acids• only searches for diagonals in region of word

matches = faster searching

Page 6: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

FASTA Algorithm

Page 7: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Makes Longest Diagonal

3) after all diagonals found, tries to join diagonals by adding gaps

4) computes alignments in regions of best diagonals

Page 8: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

FASTA Alignments

Page 9: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to BioinformaticsFASTA Results - Histogram !!SEQUENCE_LIST 1.0(Nucleotide) FASTA of: b2.seq from: 1 to: 693 December 9, 2002 14:02TO: /u/browns02/Victor/Search-set/*.seq Sequences: 2,050 Symbols: 913,285 Word Size: 6 Searching with both strands of the query. Scoring matrix: GenRunData:fastadna.cmp Constant pamfactor used Gap creation penalty: 16 Gap extension penalty: 4

Histogram Key: Each histogram symbol represents 4 search set sequences Each inset symbol represents 1 search set sequences z-scores computed from opt scoresz-score obs exp (=) (*)< 20 0 0: 22 0 0: 24 3 0:= 26 2 0:= 28 5 0:== 30 11 3:*== 32 19 11:==*== 34 38 30:=======*== 36 58 61:===============* 38 79 100:==================== * 40 134 140:==================================* 42 167 171:==========================================* 44 205 189:===============================================*==== 46 209 192:===============================================*===== 48 177 184:=============================================*

Page 10: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

FASTA Results - List

The best scores are: init1 initn opt z-sc E(1018780)..

SW:PPI1_HUMAN Begin: 1 End: 269

! Q00169 homo sapiens (human). phosph... 1854 1854 1854 2249.3 1.8e-117

SW:PPI1_RABIT Begin: 1 End: 269

! P48738 oryctolagus cuniculus (rabbi... 1840 1840 1840 2232.4 1.6e-116

SW:PPI1_RAT Begin: 1 End: 270

! P16446 rattus norvegicus (rat). pho... 1543 1543 1837 2228.7 2.5e-116

SW:PPI1_MOUSE Begin: 1 End: 270

! P53810 mus musculus (mouse). phosph... 1542 1542 1836 2227.5 2.9e-116

SW:PPI2_HUMAN Begin: 1 End: 270

! P48739 homo sapiens (human). phosph... 1533 1533 1533 1861.0 7.7e-96

SPTREMBL_NEW:BAC25830 Begin: 1 End: 270

! Bac25830 mus musculus (mouse). 10, ... 1488 1488 1522 1847.6 4.2e-95

SP_TREMBL:Q8N5W1 Begin: 1 End: 268

! Q8n5w1 homo sapiens (human). simila... 1477 1477 1522 1847.6 4.3e-95

SW:PPI2_RAT Begin: 1 End: 269

! P53812 rattus norvegicus (rat). pho... 1482 1482 1516 1840.4 1.1e-94

Page 11: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

FASTA Results - AlignmentSCORES Init1: 1515 Initn: 1565 Opt: 1687 z-score: 1158.1 E(): 2.3e-58

>>GB_IN3:DMU09374 (2038 nt) initn: 1565 init1: 1515 opt: 1687 Z-score: 1158.1 expect(): 2.3e-58 66.2% identity in 875 nt overlap (83-957:151-1022)

60 70 80 90 100 110 u39412.gb_pr CCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGAAGCGGAGGCGATGGCGCTGTTGGCC || ||| | ||||| | ||| |||||DMU09374 AGGCGGACATAAATCCTCGACATGGGTGACAACGAACAGAAGGCGCTCCAACTGATGGCC 130 140 150 160 170 180

120 130 140 150 160 170 u39412.gb_pr GAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCTTCTCTGGCCTCTTTGGAGGCTCA ||||||||| || ||| | | || ||| | || || ||||| || DMU09374 GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTCTGGGATCGCTGTTCGGAGGGTCC 190 200 210 220 230 240

180 190 200 210 220 230 u39412.gb_pr TCCAAAATAGAGGAAGCATGCGAAATCTACGCCAGAGCAGCAAACATGTTCAAAATGGCC ||| | ||||| || ||| |||| | || | |||||||| || ||| ||DMU09374 AACAAGGTGGAGGACGCCATCGAGTGCTACCAGCGGGCGGGCAACATGTTTAAGATGTCC 250 260 270 280 290 300

240 250 260 270 280 290 u39412.gb_pr AAAAACTGGAGTGCTGCTGGAAACGCGTTCTGCCAGGCTGCACAGCTGCACCTGCAGCTC |||||||||| ||||| | |||||| |||| ||| || ||| || | DMU09374 AAAAACTGGACAAAGGCTGGGGAGTGCTTCTGCGAGGCGGCAACTCTACACGCGCGGGCT 310 320 330 340 350 360

Page 12: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

FASTA on the Web

Many websites offer FASTA searches• Various databases and various other services• Be sure to use FASTA 3

• Each server has its limits• Be aware that you are depending on the

kindness of strangers.

Page 13: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Institut de Génétique Humaine, Montpellier France, GeneStream server

http://www2.igh.cnrs.fr/bin/fasta-guess.cgiOak Ridge National Laboratory GenQuest server

http://avalon.epm.ornl.gov/European Bioinformatics Institute, Cambridge, UK

http://www.ebi.ac.uk/htbin/fasta.py?requestEMBL, Heidelberg, Germany

http://www.embl-heidelberg.de/cgi/fasta-wrapper-freeMunich Information Center for Protein Sequences (MIPS)at Max-Planck-Institut, Germany

http://speedy.mips.biochem.mpg.de/mips/programs/fasta.htmlInstitute of Biology and Chemistry of Proteins Lyon, France

http://www.ibcp.fr/serv_main.htmlInstitute Pasteur, France

http://central.pasteur.fr/seqanal/interfaces/fasta.htmlGenQuest at The Johns Hopkins University

http://www.bis.med.jhmi.edu/Dan/gq/gq.form.htmlNational Cancer Center of Japan

http://bioinfo.ncc.go.jp

Page 14: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

BLAST Searches GenBank[BLAST= Basic Local Alignment Search Tool]

The NCBI BLAST web server lets you compare your query sequence to various sections of GenBank:

• nr = non-redundant (main sections)• month = new sequences from the past few weeks• ESTs• human, drososphila, yeast, or E.coli genomes• proteins (by automatic translation)

• This is a VERY fast and powerful computer.

Page 15: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

BLAST• Uses word matching like FASTA• Similarity matching of words (3 aa’s, 11 bases)

• does not require identical words.

• If no words are similar, then no alignment• won’t find matches for very short sequences

• Does not handle gaps well• New “gapped BLAST” (BLAST 2) is better

Page 16: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

BLAST Algorithm

Page 17: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

BLAST Word MatchingMEAAVKEEISVEDEAVDKNI

MEA EAA AAV AVK VKE KEE EEI EIS ISV

...

Break query into words:

Break database sequences

into words:

Page 18: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Database Sequence Word Lists RTT AAQ

SDG KSSSRW LLNQEL RWYVKI GKGDKI NISLFC WDVAAV KVRPFR DEI… …

Compare Word Lists

Query Word List:

MEAEAAAAVAVKVKLKEEEEIEISISV

?

Compare word lists by Hashing

(allow near matches)

Page 19: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

ELEPRRPRYRVPDVLVADPPIARLSVSGRDENSVELTMEAT

TDVRWMSETGIIDVFLLLGPSISDVFRQYASLTGTQALPPLFSLGYHQSRWNY

IWLDIEEIHADGKRYFTWDPSRFPQPRTMLERLASKRRVKLVAIVDPH

MEAEAAAAVAVKKLVKEEEEIEISISV

Find locations of matching words in database sequences

Page 20: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Extend hits one base at a time

Page 21: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

HVTGRSAF_FSYYGYGCYCGLGTGKGLPVDATDRCCWA

QSVFDYIYYGCYCGWGLG_GK__PRDA

•Use two word matches as anchors to build an alignment between the query and a database sequence.

•Then score the alignment.

Query:

Seq_XYZ:

E-val=10-13

Page 22: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

HSPs are Aligned Regions

• The results of the word matching and attempts to extend the alignment are segments

- called HSPs (High-scoring Segment Pairs) • BLAST often produces several short HSPs

rather than a single aligned region

Page 23: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

BLAST 2 algorithm• The NCBI’s BLAST website now both use

BLAST 2 (also known as “gapped BLAST”)

• This algorithm is more complex than the original BLAST

• It requires two word matches close to each other on a pair of sequences (i.e. with a gap) before it creates an alignment

Page 24: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Statistical tests

• Evaluate the probability of an event taking place by chance (at random).

• P-value• Randomized data• Distribution under the same setup

• Z-score• Chebyshev Inequality

Page 25: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

BLAST Statistics• E value is equivalent to standard P value

(based on Karlin-Altschul theorem)

• Significant if E < 0.05 (smaller numbers are more significant) • The E-value represents the likelihood that the

observed alignment is due to chance alone. A value of 1 indicates that an alignment this good would happen by chance with any random sequence searched against this database.

Page 26: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Program Query Database Comparison Common use

blastn DNA DNA DNA level Seek identical DNAsequences andsplicing patterns

blastp Protein Protein Protein level Find homologousproteins

blastx DNA Protein Protein level Analyze new DNAto find genes andseek homologousproteins

tblastn Protein DNA Protein level Search for genes inunannotated DNA

tblastx DNA DNA Protein level Discover genestructure

BLAST variants for different searchesa

(after S. Brenner, Trends Guide to Bioinformatics, 1998)

Page 27: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

BLAST is Approximate• BLAST makes similarity searches very

quickly because it takes shortcuts.• looks for short, nearly identical “words” (11 bases)

• It also makes errors • misses some important similarities• makes many incorrect matches

• easily fooled by repeats or skewed composition

Page 28: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Interpretation of output• very low E values (e-100) are homologs or

identical genes

• moderate E values are related genes

• long list of gradually declining of E values indicates a large gene family

• long regions of moderate similarity are more significant than short regions of high identity

Page 29: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Biological Relevance• It is up to you, the biologist to scrutinize these

alignments and determine if they are significant.

• Were you looking for a short region of nearly identical sequence or a larger region of general similarity?

• Are the mismatches conservative ones?

• Are the matching regions important structural components of the genes or just introns and flanking regions?

Page 30: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Borderline similarity• What to do with matches with E() values in

the 0.5 -1.0 range?

• this is the “Twilight Zone”

• retest these sequences and look for related hits (not just your original query sequence)

• similarity is transitive:

if A~B and B~C, then A~C

Page 31: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Position Specific Iterated BLAST• Collect all database sequence segments that

have been aligned with query sequence with E-value below set threshold (default 0.01)

• Construct position specific scoring matrix for collected sequences. Rough idea:• Align all sequences to the query sequence as the

template.• Assign weights to the sequences • Construct position specific scoring matrix

• Iterate

Page 32: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Motif finding• Observation : Some regions have been better

conserved than others during evolution

• Idea: By analyzing the constant and variable properties of such groups of similar sequences, it is possible to derive a signature for a protein family or domain (motifs)

Page 33: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

PROSITE patterns

• PROSITE fingerprints are described by regular grammars• There is a number of programs that allow to search databases

for PROSITE patterns (example GCG package)

• Example [EDQH]-x-K-x-[DN]-G-x-R-[GACV]• Rules:

• Each position is separated by a hyphen• One character denotes residuum at a given position• […] denoted a set of allowed residues• (n) denotes repeat of n• (n,m) denoted repeat between n and m inclusive

Ex. ATP/GTP binding motive [SG]=X(4)-G-K-[DT]

Page 34: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Multiple sequence alignment

Page 35: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Generalizing the Notion of Pairwise Alignment

• Alignment of 2 sequences is represented as a 2-row matrix• In a similar way, we represent alignment of 3

sequences as a 3-row matrix

A T _ G C G _ A _ C G T _ A A T C A C _ A

• Score: more conserved columns, better alignment

Page 36: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Alignments = Paths• Align 3 sequences: ATGC, AATC,ATGC

A A T -- C

A -- T G C

-- A T G C

Page 37: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Alignment Paths

0 1 1 2 3 4

A A T -- C

A -- T G C

-- A T G C

x coordinate

Page 38: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Alignment Paths• Align the 3 sequences: ATGC, AATC,ATGC

0 1 1 2 3 4

0 1 2 3 3 4

A A T -- C

A -- T G C

-- A T G C

x coordinate

y coordinate

Page 39: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Alignment Paths

0 1 1 2 3 4

0 1 2 3 3 4

A A T -- C

A -- T G C

0 0 1 2 3 4

-- A T G C

• Resulting path in (x,y,z) space:

(0,0,0)(1,1,0)(1,2,1) (2,3,2) (3,3,3) (4,4,4)

x coordinate

y coordinate

z coordinate

Page 40: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Aligning Three Sequences• Same strategy as

aligning two sequences• Use a 3-D “Manhattan

Cube”, with each axis representing a sequence to align

• For global alignments, go from source to sink

source

sink

Page 41: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

2-D vs 3-D Alignment Grid

V

W

2-D edit graph

3-D edit graph

Page 42: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

2-D cell versus 2-D Alignment Cell

In 3-D, 7 edges in each unit cube

In 2-D, 3 edges in each unit square

Page 43: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Architecture of 3-D Alignment Cell(i-1,j-1,k-1)

(i,j-1,k-1)

(i,j-1,k)

(i-1,j-1,k) (i-1,j,k)

(i,j,k)

(i-1,j,k-1)

(i,j,k-1)

Page 44: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Multiple Alignment: Dynamic Programming

• si,j,k = max

(x, y, z) is an entry in the 3-D scoring matrix

si-1,j-1,k-1 + (vi, wj, uk)

si-1,j-1,k + (vi, wj, _ )

si-1,j,k-1 + (vi, _, uk)

si,j-1,k-1 + (_, wj, uk)

si-1,j,k + (vi, _ , _)

si,j-1,k + (_, wj, _)

si,j,k-1 + (_, _, uk)

cube diagonal: no indels

face diagonal: one indel

edge diagonal: two indels

Page 45: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Multiple Alignment: Running Time

• For 3 sequences of length n, the run time is 7n3; O(n3)

• For k sequences, build a k-dimensional Manhattan, with run time (2k-1)(nk); O(2knk)

• Conclusion: dynamic programming approach for alignment between two sequences is easily extended to k sequences but it is impractical due to exponential running time.

Page 46: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Profile Representation of Multiple Alignment

- A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G

A 1 1 .8 C .6 1 .4 1 .6 .2G 1 .2 .2 .4 1T .2 1 .6 .2- .2 .8 .4 .8 .4

Page 47: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Profile Representation of Multiple Alignment

In the past we were aligning a sequence against a sequence

Can we align a sequence against a profile?

Can we align a profile against a profile?

- A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G

A 1 1 .8 C .6 1 .4 1 .6 .2G 1 .2 .2 .4 1T .2 1 .6 .2- .2 .8 .4 .8 .4

Page 48: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Aligning alignments

• Given two alignments, can we align them?

x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG

w GGACGTACC-- Alignment 2v GGACCT-----

Page 49: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Aligning alignments• Given two alignments, can we align them? • Hint: use alignment of corresponding profiles

x GGGCACTGCATy GGTTACGTC-- Combined Alignment z GGGAACTGCAG

w GGACGTACC-- v GGACCT-----

Page 50: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Multiple Alignment: Greedy Approach• Choose most similar pair of strings and combine into a

profile , thereby reducing alignment of k sequences to an alignment of of k-1 sequences/profiles. Repeat

• This is a heuristic greedy method

u1= ACGTACGTACGT…

u2 = TTAATTAATTAA…

u3 = ACTACTACTACT…

uk = CCGGCCGGCCGG

u1= ACg/tTACg/tTACg/cT…

u2 = TTAATTAATTAA…

uk = CCGGCCGGCCGG…

k

k-1

Page 51: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Greedy Approach: Example

• Consider these 4 sequences

s1 GATTCAs2 GTCTGAs3 GATATTs4 GTCAGC

Page 52: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Greedy Approach: Example (cont’d)• There are = 6 possible alignments

2

4

s2 GTCTGAs4 GTCAGC (score = 2)

s1 GAT-TCAs2 G-TCTGA (score = 1)

s1 GAT-TCAs3 GATAT-T (score = 1)

s1 GATTCA--s4 G—T-CAGC(score = 0)

s2 G-TCTGAs3 GATAT-T (score = -1)

s3 GAT-ATTs4 G-TCAGC (score = -1)

Page 53: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Greedy Approach: Example (cont’d)

s2 and s4 are closest; combine:

s2 GTCTGAs4 GTCAGC

s2,4 GTCt/aGa/cA (profile)

s1 GATTCAs3 GATATTs2,4 GTCt/aGa/c

new set of 3 sequences:

Page 54: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Progressive Alignment

• Progressive alignment is a variation of greedy algorithm with a somewhat more intelligent strategy for choosing the order of alignments.

• Progressive alignment works well for close sequences, but deteriorates for distant sequences• Gaps in consensus string are permanent• Use profiles to compare sequences

Page 55: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

ClustalW

• Popular multiple alignment tool today• ‘W’ stands for ‘weighted’ (different parts of

alignment are weighted differently).• Three-step process

1.) Construct pairwise alignments

2.) Build Guide Tree

3.) Progressive Alignment guided by the tree

Page 56: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Step 1: Pairwise Alignment

• Aligns each sequence again each other giving a similarity matrix

• Similarity = exact matches / sequence length (percent identity)

v1 v2 v3 v4

v1 -v2 .17 -v3 .87 .28 -v4 .59 .33 .62 -

(.17 means 17 % identical)

Page 57: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Step 2: Guide Tree

•Create Guide Tree using the similarity matrix

•ClustalW uses the neighbor-joining method

•Guide tree roughly reflects evolutionary relations

Page 58: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Step 2: Guide Tree (cont’d)

v1

v3

v4 v2

Calculate:v1,3 = alignment (v1, v3)v1,3,4 = alignment((v1,3),v4)v1,2,3,4 = alignment((v1,3,4),v2)

v1 v2 v3 v4

v1 -v2 .17 -v3 .87 .28 -v4 .59 .33 .62 -

Page 59: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Step 3: Progressive Alignment• Start by aligning the two most similar sequences

• Following the guide tree, add in the next sequences, aligning to the existing alignment

• Insert gaps as necessary

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ . . : ** . :.. *:.* * . * **:

Dots and stars show how well-conserved a column is.

Page 60: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Multiple Alignments: Scoring

• Number of matches (multiple longest common subsequence score)

• Entropy score

• Sum of pairs (SP-Score)

Page 61: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Multiple LCS Score

• A column is a “match” if all the letters in the column are the same

• Only good for very similar sequences

AAAAAAAATATC

Page 62: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Entropy

• Define frequencies for the occurrence of each letter in each column of multiple alignment• pA = 1, pT=pG=pC=0 (1st column)

• pA = 0.75, pT = 0.25, pG=pC=0 (2nd column)

• pA = 0.50, pT = 0.25, pC=0.25 pG=0 (3rd column)

• Compute entropy of each column

CGTAX

XX pp,,,

log

AAAAAAAATATC

Page 63: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Entropy: Example

0

A

A

A

A

entropy

2)24

1(4

4

1log

4

1

C

G

T

A

entropy

Best case

Worst case

Page 64: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Multiple Alignment: Entropy Score

Entropy for a multiple alignment is the sum of entropies of its columns:

over all columns X=A,T,G,C pX logpX

Page 65: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Entropy of an Alignment: Example

column entropy: -( pAlogpA + pClogpC + pGlogpG + pTlogpT)

•Column 1 = -[1*log(1) + 0*log0 + 0*log0 +0*log0] = 0

•Column 2 = -[(1/4)*log(1/4) + (3/4)*log(3/4) + 0*log0 + 0*log0] = -[ (1/4)*(-2) + (3/4)*(-.415) ] = +0.811

•Column 3 = -[(1/4)*log(1/4)+(1/4)*log(1/4)+(1/4)*log(1/4) +(1/4)*log(1/4)] = 4* -[(1/4)*(-2)] = +2.0

•Alignment Entropy = 0 + 0.811 + 2.0 = +2.811

A A A

A C C

A C G

A C T

Page 66: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Multiple Alignment Induces Pairwise Alignments

Every multiple alignment induces pairwise alignments

x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG

Induces:

x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Page 67: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Sum of Pairs Score(SP-Score)

• Consider pairwise alignment of sequences

ai and aj

imposed by a multiple alignment of k sequences

• Denote the score of this suboptimal (not necessarily optimal) pairwise alignment as

s*(ai, aj)

• Sum up the pairwise scores for a multiple alignment:

s(a1,…,ak) = Σi,j s*(ai, aj)

Page 68: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Computing SP-Score

Aligning 4 sequences: 6 pairwise alignments

Given a1,a2,a3,a4:

s(a1…a4) = s*(ai,aj) = s*(a1,a2) + s*(a1,a3) + s*(a1,a4) + s*(a2,a3) + s*(a2,a4) + s*(a3,a4)

Page 69: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

SP-Score: Examplea1

.ak

ATG-C-AATA-G-CATATATCCCATTT

ji

jik aaSaaS,

*1 ),()...(

2

nPairs of Sequences

A

A A

11

1

G

C G

1

Score=3 Score = 1 –

Column 1 Column 3

s s*(

To calculate each column:

Page 70: An Introduction to Bioinformatics Comparing biological sequences (3): Database searching and Multiple alignment

An Introduction to Bioinformatics

Multiple Alignment: History

1975 SankoffFormulated multiple alignment problem and gave dynamic programming solution

1988 Carrillo-LipmanBranch and Bound approach for MSA

1990 Feng-DoolittleProgressive alignment

1994 Thompson-Higgins-Gibson-ClustalWMost popular multiple alignment program

1998 Morgenstern et al.-DIALIGNSegment-based multiple alignment

2000 Notredame-Higgins-Heringa-T-coffeeUsing the library of pairwise alignments

2004 MUSCLE