from pairwise to multiple alignment

From Pairwise to Multiple Alignment

WHATS TODAY?

• Multiple Sequence Alignment- CLUSTAL

• MOTIF search

Multiple Sequence Alignment

MSA

VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--

Like pairwise alignment BUT compare n sequences instead of 2

Rows represent individual sequences Columns represent ‘same’ position

Gaps allowed in all sequences

How to find the best MSA

GTCGTAGTCG-GC-TCGACGTC-TAG-CGAGCGT-GATGC-GAAG-AG-GCG-AG-CGCCGTCG-CG-TCGTA-AC

GTCGTAGTCGGCTCGACGTCTAGCGAGCGTGATGCGAAGAGGCGAGCGCCGTCGCGTCGTAAC

1*12*0.7511*0.5

Score=8

4*111*0.752*0.5

Score=13.25

Score : 4/4 =1 , 3/4 =0.75 , 2/4=0.5 , 1/4= 0

Alignment of 3 sequences:

Complexity: length A length B length C

Aligning 100 proteins, 1000 amino acids eachComplexity: 10300 table cellsCalculation time: beyond the big bang!

Feasible Approach

• Based on pairwise alignment scores– Build n by n table of pairwise scores

• Align similar sequences first– After alignment, consider as single sequence– Continue aligning with further sequences

Progressive alignment (Feng & Doolittle).

– For n sequences, there are n(n-1)/2 pairs

GTCGTAGTCG-GC-TCGACGTC-TAG-CGAGCGT-GATGC-GAAG-AG-GCG-AG-CGCCGTCG-CG-TCGTA-AC

1 GTCGTAGTCG-GC-TCGAC2 GTC-TAG-CGAGCGT-GAT3 GC-GAAGAGGCG-AGC4 GCCGTCGCGTCGTAAC

1 GTCGTA-GTCG-GC-TCGAC2 GTC-TA-G-CGAGCGT-GAT3 G-C-GAAGA-G-GCG-AG-C4 G-CCGTCGC-G-TCGTAA-C

CLUSTAL method

• Higgins and Sharp 1988 – ref: CLUSTAL: a package for performing

multiple sequence alignment on a microcomputer. Gene, 73, 237–244. [Medline]

An approximation strategy (heuristic algorithm) yields a possible alignment, but not necessarily the best one

Applies Progressive Sequence Alignment

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=retrieve&db=pubmed&list_uids=3243435&dopt=Abstract

Treating Gaps in CLUSTAL

• Penalty for opening gaps and additional penalty for extending the gap

• Gaps found in initial alignment remain fixed

• New gaps are introduced as more sequences are added (decreased penalty if gap exists)

Other MSA Approaches• Progressive approach

CLUSTALW (CLUSTALX) PILEUP

T-COFFEE

• Iterative approach: Repeatedly realign subsets of sequences.

MultAlin, DiAlign.

• Statistical Methods:Hidden Markov Models (only for proteins)

SAM2K, MUSCLE

• Genetic algorithmSAGA

Links to commonly used MSA tools

CLUSTALWhttp://www.ebi.ac.uk/Tools/clustalw2/T-COFFEEhttp://www.ebi.ac.uk/t-coffee/MUSCLEhttp://www.ebi.ac.uk/muscle/MAFFThttp://www.ebi.ac.uk/mafft/Kalignhttp://www.ebi.ac.uk/kalign/

CAUTION !!! Different tools may give different results

Example : 7 different alignment tools produced 6 differentEstimated evolution trees

Wong et al., Science 319, January 2008

Motifs

• Motifs represent a short common sequence– Regulatory motifs (TF binding sites) – Functional site in proteins (DNA binding motif)

DNA Regulatory Motifs

• Transcription Factors bind to regulatory motifs – TF binding motifs are usually 6 – 20

nucleotides long– Usually located near target gene, mostly

upstreamTranscription Start Site

SBFmotif

MCM1motif

Gene X

MCM1 SBF

E. Coli promoter sequences

Challenges

• How to recognize a regulatory motif?

• Can we identify new occurrences of known motifs in genome sequences?

• Can we discover new motifs within upstream sequences of genes?

1. Motif Representation

• Exact motif: CGGATATA• Consensus: represent only

deterministic nucleotides.– Example: HAP1 binding sites in 5

sequences.• consensus motif: CGGNNNTANCGG • N stands for any nucleotide.

• Representing only consensus loses information. How can this be avoided?

CGGATATACCGG

CGGTGATAGCGG

CGGTACTAACGG

CGGCGGTAACGG

CGGCCCTAACGG

------------

CGGNNNTANCGG

TTGACA

-35

TATAAT

-10

Transcription start site

Representing the motif as a profile

-35 -10

A

T

GC

1 2 3 4 5 6

A

T

GC

1 2 3 4 5 6

Based on ~450 known promoters

0.1 0.1 0.1 0.5 0.2 0.5

0.7 0.7 0.2 0.2 0.2 0.2

0.1 0.1 0.5 0.1 0.1 0.2

0.1 0.1 0.2 0.2 0.5 0.1

0.1 0.7 0.2 0.6 0.5 0.1

0.7 0.1 0.5 0.2 0.2 0.8

0.1 0.1 0.1 0.1 0.1 0.0

0.1 0.1 0.2 0.1 0.1 0.1

1 2 3 4 5

A 10 25 5 70 60

C 30 25 80 10 15

T 50 25 5 10 5

G 10 25 10 10 20

PSPM – Position Specific Probability Matrix

• Represents a motif of length k (5)• Count the number of occurrence of each nucleotide in

each position

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2


• Defines Pi{A,C,G,T} for i={1,..,k}.

– Pi (A) – frequency of nucleotide A in position i.

Identification of Known Motifs within Genomic Sequences

• Motivation: – identification of new genes controlled by the same

TF.– Infer the function of these genes.– enable better understanding of the regulation

mechanism.

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2


• Each k-mer is assigned a probability. – Example: P(TCCAG)=0.5*0.25*0.8*0.7*0.2

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2

Detecting a Known Motif within a Sequence using PSPM

• The PSPM is moved along the query sequence.• At each position the sub-sequence is scored for a

match to the PSPM.• Example:

sequence = ATGCAAGTCT…

• The PSPM is moved along the query sequence.• At each position the sub-sequence is scored for a

match to the PSPM.• Example:

sequence = ATGCAAGTCT…• Position 1: ATGCA

0.1*0.25*0.1*0.1*0.6=1.5*10-4

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2


• The PSPM is moved along the query sequence.• At each position the sub-sequence is scored for a match to

the PSPM.• Example:

sequence = ATGCAAGTCT…• Position 1: ATGCA

0.1*0.25*0.1*0.1*0.6=1.5*10-4

• Position 2: TGCAA 0.5*0.25*0.8*0.7*0.6=0.042

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2


Detecting a Known Motif within a Sequence using PSSM

Is it a random match, or is it indeed an occurrence of the motif?

PSPM -> PSSM (Probability Specific Scoring Matrix)– odds score : Oi(n) where n {A,C,G,T} for i={1,..,k}

– defined as Pi(n)/P(n), where P(n) is background frequency.

Oi(n) increases => higher odds that n at position i is part of a real motif.

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

1 2 3 4 5

A 0.4 1 0.2 2.8 2.4

1 2 3 4 5

A -1.322 0 -2.322 1.485

1.263

PSSM as Odds Score Matrix• Assumption: the background frequency of each

nucleotide is 0.25.

1. Original PSPM (Pi):

2. Odds Matrix (Oi):

3. Going to log scale we get an additive score,Log odds Matrix (log2Oi):

1 2 3 4 5

A -1.32 0 -2.32 1.48 1.26

C 0.26 0 1.68 -1.32 -0.74

T 1 0 -2.32 -1.32 -2.32

G -1.32 0 -1.32 -1.32 -0.32

Calculating using Log Odds Matrix• Odds 0 implies random match;

Odds > 0 implies real match (?).• Example: sequence = ATGCAAGTCT…• Position 1: ATGCA

-1.32+0-1.32-1.32+1.26=-2.7odds= 2-2.7=0.15

• Position 2: TGCAA1+0+1.68+1.48+1.26 =5.42odds=25.42=42.8

Calculating the probability of a Match

ATGCAAG

• Position 1 ATGCA = 0.15


ATGCAAG


• Position 2 TGCAA = 42.3


ATGCAAG



• Position 3 GCAAG =0.18

Calculating the probability of a match

ATGCAAG



• Position 3 GCAAG =0.18

P (i) = S / (∑ S)Example 0.15 /(.15+42.8+.18)=0.003

P (1)= 0.003P (2)= 0.993P (3) =0.004

Building a PSSM

• Collect all known sequences that bind a certain TF.

• Align all sequences (using multiple sequence alignment).

• Compute the frequency of each nucleotide in each position (PSPM).

• Incorporate background frequency for each nucleotide (PSSM).

Graphical Representation – Sequence Logo

• Horizontal axis: position of the base in the sequence.

• Vertical axis: amount of information (bits).

• Letter stack: order indicates importance.

• Letter height: indicates frequency.

• Consensus can be read across the top of the letter columns.

• http://weblogo.berkeley.eduWebLogo - Input

http://meme.sdsc.edu/

Genes:WebLogo - Output

Proteins:

Finding new Motifs

• We are given a group of genes, which presumably contain a common regulatory motif.

• We know nothing of the TF that binds to the putative motif.

• The problem: discover the motif.

Example

Predicting the cAMP Receptor Protein (CRP) binding site motif

Extract experimentally defined CRP Binding Sites GGATAACAATTTCACAAGTGTGTGAGCGGATAACAAAAGGTGTGAGTTAGCTCACTCCCCTGTGATCTCTGTTACATAGACGTGCGAGGATGAGAACACAATGTGTGTGCTCGGTTTAGTTCACCTGTGACACAGTGCAAACGCGCCTGACGGAGTTCACAAATTGTGAGTGTCTATAATCACGATCGATTTGGAATATCCATCACATGCAAAGGACGTCACGATTTGGGAGCTGGCGACCTGGGTCATGTGTGATGTGTATCGAACCGTGTATTTATTTGAACCACATCGCAGGTGAGAGCCATCACAGGAGTGTGTAAGCTGTGCCACGTTTATTCCATGTCACGAGTGTTGTTATACACATCACTAGTGAAACGTGCTCCCACTCGCATGTGATTCGATTCACA

Create a Multiple Sequence Alignment GGATAACAATTTCACATGTGAGCGGATAACAATGTGAGTTAGCTCACTTGTGATCTCTGTTACACGAGGATGAGAACACACTCGGTTTAGTTCACCTGTGACACAGTGCAAACCTGACGGAGTTCACAAGTGTCTATAATCACGTGGAATATCCATCACATGCAAAGGACGTCACGGGCGACCTGGGTCATGTGTGATGTGTATCGAATTTGAACCACATCGCAGGTGAGAGCCATCACATGTAAGCTGTGCCACGTTTATTCCATGTCACGTGTTATACACATCACTCGTGCTCCCACTCGCATGTGATTCGATTCACA

XXXXXTGTGAXXXXAXTCACAXXXXXXXXXXXXACACTXXXXTXGATGTXXXXXXX

Generate a PSSM

PROBLEMS…

• When searching for a motif in a genome using PSSM or other methods – the motif is usually found all over the place

->The motif is considered real if found in the vicinity of a gene.

• Checking experimentally for the binding sites of a specific TF (location analysis) – the sites that bind the motif are in some cases similar to the PSSM and sometimes not!

Computational Methods• This problem has received a lot of attention from

CS people.• Methods include:

– Probabilistic methods – hidden Markov models (HMMs), expectation maximization (EM), Gibbs sampling, etc.

– Enumeration methods – problematic for inexact motifs of length k>10. …

• Current status: Problem is still open.

Tools on the Web• MEME – Multiple EM for Motif Elicitation.

http://meme.sdsc.edu/meme/website/• metaMEME- Uses HMM method

http://meme.sdsc.edu/meme• MAST-Motif Alignment and Search Tool

http://meme.sdsc.edu/meme

• TRANSFAC - database of eukaryotic cis-acting regulatory DNA elements and trans-acting factors. http://transfac.gbf.de/TRANSFAC/

• eMotif - allows to scan, make and search for motifs at the protein level. http://motif.stanford.edu/emotif/

http://meme.sdsc.edu/meme/website/



http://transfac.gbf.de/TRANSFAC/

from pairwise to multiple alignment

Documents

n sequences

possible alignment

initial alignment

different alignment

genome sequences

subsets of sequences

alignment scoresbuild

individual sequences