from pairwise to multiple alignment
DESCRIPTION
From Pairwise to Multiple Alignment. WHATS TODAY?. Multiple Sequence Alignment- CLUSTAL MOTIF search. M ultiple S equence A lignment MSA. VTIS C TGSSSNIGAG-NHVK W YQQLPG VTIS C TGTSSNIGS--ITVN W YQQLPG LRLS C SSSGFIFSS--YAMY W VRQAPG LSLT C TVSGTSFDD--YYST W VRQPPG - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/1.jpg)
From Pairwise to Multiple Alignment
![Page 2: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/2.jpg)
WHATS TODAY?
• Multiple Sequence Alignment- CLUSTAL
• MOTIF search
![Page 3: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/3.jpg)
Multiple Sequence Alignment
MSA
![Page 4: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/4.jpg)
VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--
Like pairwise alignment BUT compare n sequences instead of 2
Rows represent individual sequences Columns represent ‘same’ position
Gaps allowed in all sequences
![Page 5: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/5.jpg)
How to find the best MSA
GTCGTAGTCG-GC-TCGACGTC-TAG-CGAGCGT-GATGC-GAAG-AG-GCG-AG-CGCCGTCG-CG-TCGTA-AC
GTCGTAGTCGGCTCGACGTCTAGCGAGCGTGATGCGAAGAGGCGAGCGCCGTCGCGTCGTAAC
1*12*0.7511*0.5
Score=8
4*111*0.752*0.5
Score=13.25
Score : 4/4 =1 , 3/4 =0.75 , 2/4=0.5 , 1/4= 0
![Page 6: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/6.jpg)
Alignment of 3 sequences:
Complexity: length A length B length C
Aligning 100 proteins, 1000 amino acids eachComplexity: 10300 table cellsCalculation time: beyond the big bang!
![Page 7: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/7.jpg)
Feasible Approach
• Based on pairwise alignment scores– Build n by n table of pairwise scores
• Align similar sequences first– After alignment, consider as single sequence– Continue aligning with further sequences
Progressive alignment (Feng & Doolittle).
![Page 8: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/8.jpg)
– For n sequences, there are n(n-1)/2 pairs
GTCGTAGTCG-GC-TCGACGTC-TAG-CGAGCGT-GATGC-GAAG-AG-GCG-AG-CGCCGTCG-CG-TCGTA-AC
![Page 9: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/9.jpg)
1 GTCGTAGTCG-GC-TCGAC2 GTC-TAG-CGAGCGT-GAT3 GC-GAAGAGGCG-AGC4 GCCGTCGCGTCGTAAC
1 GTCGTA-GTCG-GC-TCGAC2 GTC-TA-G-CGAGCGT-GAT3 G-C-GAAGA-G-GCG-AG-C4 G-CCGTCGC-G-TCGTAA-C
![Page 10: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/10.jpg)
CLUSTAL method
• Higgins and Sharp 1988 – ref: CLUSTAL: a package for performing
multiple sequence alignment on a microcomputer. Gene, 73, 237–244. [Medline]
An approximation strategy (heuristic algorithm) yields a possible alignment, but not necessarily the best one
Applies Progressive Sequence Alignment
![Page 11: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/11.jpg)
Treating Gaps in CLUSTAL
• Penalty for opening gaps and additional penalty for extending the gap
• Gaps found in initial alignment remain fixed
• New gaps are introduced as more sequences are added (decreased penalty if gap exists)
![Page 12: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/12.jpg)
Other MSA Approaches• Progressive approach
CLUSTALW (CLUSTALX) PILEUP
T-COFFEE
• Iterative approach: Repeatedly realign subsets of sequences.
MultAlin, DiAlign.
• Statistical Methods:Hidden Markov Models (only for proteins)
SAM2K, MUSCLE
• Genetic algorithmSAGA
![Page 13: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/13.jpg)
Links to commonly used MSA tools
CLUSTALWhttp://www.ebi.ac.uk/Tools/clustalw2/T-COFFEEhttp://www.ebi.ac.uk/t-coffee/MUSCLEhttp://www.ebi.ac.uk/muscle/MAFFThttp://www.ebi.ac.uk/mafft/Kalignhttp://www.ebi.ac.uk/kalign/
![Page 14: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/14.jpg)
CAUTION !!! Different tools may give different results
![Page 15: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/15.jpg)
Example : 7 different alignment tools produced 6 differentEstimated evolution trees
Wong et al., Science 319, January 2008
![Page 16: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/16.jpg)
Motifs
• Motifs represent a short common sequence– Regulatory motifs (TF binding sites) – Functional site in proteins (DNA binding motif)
![Page 17: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/17.jpg)
DNA Regulatory Motifs
• Transcription Factors bind to regulatory motifs – TF binding motifs are usually 6 – 20
nucleotides long– Usually located near target gene, mostly
upstreamTranscription Start Site
SBFmotif
MCM1motif
Gene X
MCM1 SBF
![Page 18: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/18.jpg)
E. Coli promoter sequences
![Page 19: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/19.jpg)
Challenges
• How to recognize a regulatory motif?
• Can we identify new occurrences of known motifs in genome sequences?
• Can we discover new motifs within upstream sequences of genes?
![Page 20: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/20.jpg)
1. Motif Representation
• Exact motif: CGGATATA• Consensus: represent only
deterministic nucleotides.– Example: HAP1 binding sites in 5
sequences.• consensus motif: CGGNNNTANCGG • N stands for any nucleotide.
• Representing only consensus loses information. How can this be avoided?
CGGATATACCGG
CGGTGATAGCGG
CGGTACTAACGG
CGGCGGTAACGG
CGGCCCTAACGG
------------
CGGNNNTANCGG
![Page 21: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/21.jpg)
TTGACA
-35
TATAAT
-10
Transcription start site
Representing the motif as a profile
-35 -10
A
T
GC
1 2 3 4 5 6
A
T
GC
1 2 3 4 5 6
Based on ~450 known promoters
0.1 0.1 0.1 0.5 0.2 0.5
0.7 0.7 0.2 0.2 0.2 0.2
0.1 0.1 0.5 0.1 0.1 0.2
0.1 0.1 0.2 0.2 0.5 0.1
0.1 0.7 0.2 0.6 0.5 0.1
0.7 0.1 0.5 0.2 0.2 0.8
0.1 0.1 0.1 0.1 0.1 0.0
0.1 0.1 0.2 0.1 0.1 0.1
![Page 22: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/22.jpg)
1 2 3 4 5
A 10 25 5 70 60
C 30 25 80 10 15
T 50 25 5 10 5
G 10 25 10 10 20
PSPM – Position Specific Probability Matrix
• Represents a motif of length k (5)• Count the number of occurrence of each nucleotide in
each position
![Page 23: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/23.jpg)
1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
PSPM – Position Specific Probability Matrix
• Defines Pi{A,C,G,T} for i={1,..,k}.
– Pi (A) – frequency of nucleotide A in position i.
![Page 24: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/24.jpg)
Identification of Known Motifs within Genomic Sequences
• Motivation: – identification of new genes controlled by the same
TF.– Infer the function of these genes.– enable better understanding of the regulation
mechanism.
![Page 25: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/25.jpg)
1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
PSPM – Position Specific Probability Matrix
• Each k-mer is assigned a probability. – Example: P(TCCAG)=0.5*0.25*0.8*0.7*0.2
![Page 26: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/26.jpg)
1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
Detecting a Known Motif within a Sequence using PSPM
• The PSPM is moved along the query sequence.• At each position the sub-sequence is scored for a
match to the PSPM.• Example:
sequence = ATGCAAGTCT…
![Page 27: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/27.jpg)
• The PSPM is moved along the query sequence.• At each position the sub-sequence is scored for a
match to the PSPM.• Example:
sequence = ATGCAAGTCT…• Position 1: ATGCA
0.1*0.25*0.1*0.1*0.6=1.5*10-4
1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
Detecting a Known Motif within a Sequence using PSPM
![Page 28: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/28.jpg)
• The PSPM is moved along the query sequence.• At each position the sub-sequence is scored for a match to
the PSPM.• Example:
sequence = ATGCAAGTCT…• Position 1: ATGCA
0.1*0.25*0.1*0.1*0.6=1.5*10-4
• Position 2: TGCAA 0.5*0.25*0.8*0.7*0.6=0.042
1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
Detecting a Known Motif within a Sequence using PSPM
![Page 29: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/29.jpg)
Detecting a Known Motif within a Sequence using PSSM
Is it a random match, or is it indeed an occurrence of the motif?
PSPM -> PSSM (Probability Specific Scoring Matrix)– odds score : Oi(n) where n {A,C,G,T} for i={1,..,k}
– defined as Pi(n)/P(n), where P(n) is background frequency.
Oi(n) increases => higher odds that n at position i is part of a real motif.
![Page 30: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/30.jpg)
1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
1 2 3 4 5
A 0.4 1 0.2 2.8 2.4
1 2 3 4 5
A -1.322 0 -2.322 1.485
1.263
PSSM as Odds Score Matrix• Assumption: the background frequency of each
nucleotide is 0.25.
1. Original PSPM (Pi):
2. Odds Matrix (Oi):
3. Going to log scale we get an additive score,Log odds Matrix (log2Oi):
![Page 31: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/31.jpg)
1 2 3 4 5
A -1.32 0 -2.32 1.48 1.26
C 0.26 0 1.68 -1.32 -0.74
T 1 0 -2.32 -1.32 -2.32
G -1.32 0 -1.32 -1.32 -0.32
Calculating using Log Odds Matrix• Odds 0 implies random match;
Odds > 0 implies real match (?).• Example: sequence = ATGCAAGTCT…• Position 1: ATGCA
-1.32+0-1.32-1.32+1.26=-2.7odds= 2-2.7=0.15
• Position 2: TGCAA1+0+1.68+1.48+1.26 =5.42odds=25.42=42.8
![Page 32: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/32.jpg)
Calculating the probability of a Match
ATGCAAG
• Position 1 ATGCA = 0.15
![Page 33: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/33.jpg)
Calculating the probability of a Match
ATGCAAG
• Position 1 ATGCA = 0.15
• Position 2 TGCAA = 42.3
![Page 34: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/34.jpg)
Calculating the probability of a Match
ATGCAAG
• Position 1 ATGCA = 0.15
• Position 2 TGCAA = 42.3
• Position 3 GCAAG =0.18
![Page 35: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/35.jpg)
Calculating the probability of a match
ATGCAAG
• Position 1 ATGCA = 0.15
• Position 2 TGCAA = 42.3
• Position 3 GCAAG =0.18
P (i) = S / (∑ S)Example 0.15 /(.15+42.8+.18)=0.003
P (1)= 0.003P (2)= 0.993P (3) =0.004
![Page 36: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/36.jpg)
Building a PSSM
• Collect all known sequences that bind a certain TF.
• Align all sequences (using multiple sequence alignment).
• Compute the frequency of each nucleotide in each position (PSPM).
• Incorporate background frequency for each nucleotide (PSSM).
![Page 37: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/37.jpg)
Graphical Representation – Sequence Logo
• Horizontal axis: position of the base in the sequence.
• Vertical axis: amount of information (bits).
• Letter stack: order indicates importance.
• Letter height: indicates frequency.
• Consensus can be read across the top of the letter columns.
![Page 39: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/39.jpg)
Genes:WebLogo - Output
Proteins:
![Page 40: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/40.jpg)
Finding new Motifs
• We are given a group of genes, which presumably contain a common regulatory motif.
• We know nothing of the TF that binds to the putative motif.
• The problem: discover the motif.
![Page 41: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/41.jpg)
Example
Predicting the cAMP Receptor Protein (CRP) binding site motif
![Page 42: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/42.jpg)
Extract experimentally defined CRP Binding Sites GGATAACAATTTCACAAGTGTGTGAGCGGATAACAAAAGGTGTGAGTTAGCTCACTCCCCTGTGATCTCTGTTACATAGACGTGCGAGGATGAGAACACAATGTGTGTGCTCGGTTTAGTTCACCTGTGACACAGTGCAAACGCGCCTGACGGAGTTCACAAATTGTGAGTGTCTATAATCACGATCGATTTGGAATATCCATCACATGCAAAGGACGTCACGATTTGGGAGCTGGCGACCTGGGTCATGTGTGATGTGTATCGAACCGTGTATTTATTTGAACCACATCGCAGGTGAGAGCCATCACAGGAGTGTGTAAGCTGTGCCACGTTTATTCCATGTCACGAGTGTTGTTATACACATCACTAGTGAAACGTGCTCCCACTCGCATGTGATTCGATTCACA
![Page 43: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/43.jpg)
Create a Multiple Sequence Alignment GGATAACAATTTCACATGTGAGCGGATAACAATGTGAGTTAGCTCACTTGTGATCTCTGTTACACGAGGATGAGAACACACTCGGTTTAGTTCACCTGTGACACAGTGCAAACCTGACGGAGTTCACAAGTGTCTATAATCACGTGGAATATCCATCACATGCAAAGGACGTCACGGGCGACCTGGGTCATGTGTGATGTGTATCGAATTTGAACCACATCGCAGGTGAGAGCCATCACATGTAAGCTGTGCCACGTTTATTCCATGTCACGTGTTATACACATCACTCGTGCTCCCACTCGCATGTGATTCGATTCACA
![Page 44: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/44.jpg)
XXXXXTGTGAXXXXAXTCACAXXXXXXXXXXXXACACTXXXXTXGATGTXXXXXXX
Generate a PSSM
![Page 45: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/45.jpg)
PROBLEMS…
• When searching for a motif in a genome using PSSM or other methods – the motif is usually found all over the place
->The motif is considered real if found in the vicinity of a gene.
• Checking experimentally for the binding sites of a specific TF (location analysis) – the sites that bind the motif are in some cases similar to the PSSM and sometimes not!
![Page 46: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/46.jpg)
Computational Methods• This problem has received a lot of attention from
CS people.• Methods include:
– Probabilistic methods – hidden Markov models (HMMs), expectation maximization (EM), Gibbs sampling, etc.
– Enumeration methods – problematic for inexact motifs of length k>10. …
• Current status: Problem is still open.
![Page 47: From Pairwise to Multiple Alignment](https://reader036.vdocuments.site/reader036/viewer/2022062518/56814453550346895db0ef96/html5/thumbnails/47.jpg)
Tools on the Web• MEME – Multiple EM for Motif Elicitation.
http://meme.sdsc.edu/meme/website/• metaMEME- Uses HMM method
http://meme.sdsc.edu/meme• MAST-Motif Alignment and Search Tool
http://meme.sdsc.edu/meme
• TRANSFAC - database of eukaryotic cis-acting regulatory DNA elements and trans-acting factors. http://transfac.gbf.de/TRANSFAC/
• eMotif - allows to scan, make and search for motifs at the protein level. http://motif.stanford.edu/emotif/