sequence information, logos and hidden markov models
DESCRIPTION
Sequence information, logos and Hidden Markov Models. Morten Nielsen, CBS, BioCentrum, DTU. Information content. Information and entropy Conserved amino acid regions contain high degree of information (high order == low entropy) - PowerPoint PPT PresentationTRANSCRIPT
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Sequence information, logos and Hidden Markov
ModelsMorten Nielsen,
CBS, BioCentrum, DTU
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Information content
• Information and entropy– Conserved amino acid regions contain high degree of
information (high order == low entropy)– Variable amino acid regions contain low degree of
information (low order == high entropy)
• Shannon information D = log2(N) + pi log2 pi (for proteins N=20, DNA
N=4)
• Conserved residue pA=1, pi<>A=0, D = log2(N) ( = 4.3 for proteins)
• Variable region pA=0.05, pC=0.05, .., D = 0
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Sequence logo
• Height of a column equal to D
• Relative height of a letter is pA
• Highly useful tool to visualize sequence motifs
High information positions
MHC class IILogo from 10 sequences
http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Sequence information
• Description of binding motif
• ExamplePA = 6/10
PG = 2/10
PT = PK = 1/10
PC = PD = …PV = 0
• Problems– Few data– Data
redundancy/duplication
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Sequence information Raw sequence counting
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Sequence weightingALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Pseudo counts
• Sequence weighting and pseudo count
• Motif found on more data
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
…and now you
• cp files from /usr/opt/www/pub/CBS/researchgroups/immunology/intro/HMM/exercise
• Make weight matrix and logos using– pep2mat -swt 2 -wlc 0 data > mat– mat2logo mat– ghostview logo.ps
• Include sequence weighting – pep2mat -swt 0 -wlc 0 data > mat– make and view logo– Try the other sequence weighting scheme (clustering) -swt 1. What
difference does this make?• Include pseudo counts
– pep2mat data > mat– make and view logo
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Weight matrices
• Estimate amino acid frequencies from alignment including sequence weighting and pseudo counts
• Construct a weight matrix as
Wij = log(pij/qj)• Here i is a position in the motif, and j an amino
acid. qj is the prior frequency for amino acid j.• W is a L x 20 matrix, L is motif length• Score sequences to weight matrix by looking
up and adding L values from matrix
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Weight matrix predictions
• Use the program seq2hmm to evaluate the prediction accuracy of your weight matrix– seq2hmm -hmm mat -xs eval.set | grep -v # | args 2,3 |
xycorr– What is going on here?
• By leaving out the -xs option you can generate the scores at each position in the sequence. This is often useful for Neural Network training
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
MHC class II prediction
• Complexity of problem– Peptides of
different length– Weak motif signal
• Alignment crucial• Gibbs Monte Carlo
sampler
RFFGGDRGAPKRGYLDPLIRGLLARPAKLQVKPGQPPRLLIYDASNRATGIPAGSLFVYNITTNKYKAFLDKQSALLSSDITASVNCAKPKYVHQNTLKLATGFKGEQGPKGEPDVFKELKVHHANENISRYWAIRTRSGGITYSTNEIDLQLSQEDGQTIE
DRB1*0401 peptides
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Gibbs sample algorithm
RFFGGDRGAPKRG YLDPLIRGLLARPAKLQVKPGQPPRLLIYDASNRATGIPA GSLFVYNITTNKYKAFLDKQ SALLSSDITASVNCAK PKYVHQNTLKLAT GFKGEQGPKGEP DVFKELKVHHANENI SRYWAIRTRSGGI TYSTNEIDLQLSQEDGQTI
Alignment by Gibbs sampler
E = i,j pij * log( p`ij/qi )
Maximize E using MC• Random change in offset• Random shift on box position• Accept moves to higher E always• Accept moves to lower E with decreasing probability
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Gibbs sampler exercise
• The file clasII.fsa is a FASTA file containing 50 classII epitopes
• gibbss_mc -iw -w 1,0,0,1,0,1,0,0,1 -m gibbs.mat classII.fsa– The options -iw and -w 1,0,0,1,0,1,0,0,1 increase matrix
weight on important anchor positions in binding motif– Make and view logo
• Use the matrix to predict classII epitopes– cl2pred -mat gibbs.mat classII.eval.dat | grep -v # | args
4,5 | xycorr– Do you understand what is going on in this command?
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Hidden Markov Models
• Weight matrices do not deal with insertions and deletions
• In alignments, this is done in an ad-hoc manner by optimization of the two gap penalties for first gap and gap extension
• HMM is a natural frame work where insertions/deletions are dealt with explicitly
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
HMM (a simple example)
ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATC
ACCG--ATC
• Example from A. Krogh
• Core region defines the number of states in the HMM (red)
• Insertion and deletion statistics are derived from the non-core part of the alignment (black)
Core of alignment
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
.8
.2
ACGT
ACGT
ACGT
ACGT
ACGT
ACGT.8
.8 .8.8
.2.2.2
.2
1
ACGT .2
.2
.2
.4
1. .4 1. 1.1.
.6.6
.4
HMM construction
ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATC
ACCG--ATC
• 5 matches. A, 2xC, T, G• 5 transitions in gap region
• C out, G out• A-C, C-T, T out• Out transition 3/5• Stay transition 2/5
ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x1x0.8x1x0.2 = 3.3x10-2
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Align sequence to HMMACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2 = 3.3x10-2
TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8 = 0.0075x10-2
ACAC--AGC = 1.2x10-2
AGA---ATC = 3.3x10-2
ACCG--ATC = 0.59x10-2
Consensus:
ACAC--ATC = 4.7x10-2, ACA---ATC = 13.1x10-2
Exceptional:
TGCT--AGG = 0.0023x10-2
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Align sequence to HMM - Null model
• Score depends strongly on length
• Null model is a random model. For length L the score is 0.25L
• Log-odds score for sequence S
Log( P(S)/0.25L)• Positive score means
more likely than Null model
ACA---ATG = 4.9
TCAACTATC = 3.0 ACAC--AGC = 5.3AGA---ATC = 4.9ACCG--ATC = 4.6Consensus:ACAC--ATC = 6.7 ACA---ATC = 6.3Exceptional:TGCT--AGG = -0.97
Note!