sequence information, logos and hidden markov models

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Sequence information, logos and Hidden Markov

ModelsMorten Nielsen,

CBS, BioCentrum, DTU

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Information content

• Information and entropy– Conserved amino acid regions contain high degree of

information (high order == low entropy)– Variable amino acid regions contain low degree of

information (low order == high entropy)

• Shannon information D = log2(N) + pi log2 pi (for proteins N=20, DNA

N=4)

• Conserved residue pA=1, pi<>A=0, D = log2(N) ( = 4.3 for proteins)

• Variable region pA=0.05, pC=0.05, .., D = 0

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Sequence logo

• Height of a column equal to D

• Relative height of a letter is pA

• Highly useful tool to visualize sequence motifs

High information positions

MHC class IILogo from 10 sequences

http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Sequence information

• Description of binding motif

• ExamplePA = 6/10

PG = 2/10

PT = PK = 1/10

PC = PD = …PV = 0

• Problems– Few data– Data

redundancy/duplication

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Sequence information Raw sequence counting


CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Sequence weightingALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Pseudo counts

• Sequence weighting and pseudo count

• Motif found on more data


CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

…and now you

• cp files from /usr/opt/www/pub/CBS/researchgroups/immunology/intro/HMM/exercise

• Make weight matrix and logos using– pep2mat -swt 2 -wlc 0 data > mat– mat2logo mat– ghostview logo.ps

• Include sequence weighting – pep2mat -swt 0 -wlc 0 data > mat– make and view logo– Try the other sequence weighting scheme (clustering) -swt 1. What

difference does this make?• Include pseudo counts

– pep2mat data > mat– make and view logo

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Weight matrices

• Estimate amino acid frequencies from alignment including sequence weighting and pseudo counts

• Construct a weight matrix as

Wij = log(pij/qj)• Here i is a position in the motif, and j an amino

acid. qj is the prior frequency for amino acid j.• W is a L x 20 matrix, L is motif length• Score sequences to weight matrix by looking

up and adding L values from matrix

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Weight matrix predictions

• Use the program seq2hmm to evaluate the prediction accuracy of your weight matrix– seq2hmm -hmm mat -xs eval.set | grep -v # | args 2,3 |

xycorr– What is going on here?

• By leaving out the -xs option you can generate the scores at each position in the sequence. This is often useful for Neural Network training

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

MHC class II prediction

• Complexity of problem– Peptides of

different length– Weak motif signal

• Alignment crucial• Gibbs Monte Carlo

sampler

RFFGGDRGAPKRGYLDPLIRGLLARPAKLQVKPGQPPRLLIYDASNRATGIPAGSLFVYNITTNKYKAFLDKQSALLSSDITASVNCAKPKYVHQNTLKLATGFKGEQGPKGEPDVFKELKVHHANENISRYWAIRTRSGGITYSTNEIDLQLSQEDGQTIE

DRB1*0401 peptides

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Gibbs sample algorithm

RFFGGDRGAPKRG YLDPLIRGLLARPAKLQVKPGQPPRLLIYDASNRATGIPA GSLFVYNITTNKYKAFLDKQ SALLSSDITASVNCAK PKYVHQNTLKLAT GFKGEQGPKGEP DVFKELKVHHANENI SRYWAIRTRSGGI TYSTNEIDLQLSQEDGQTI

Alignment by Gibbs sampler

E = i,j pij * log( p`ij/qi )

Maximize E using MC• Random change in offset• Random shift on box position• Accept moves to higher E always• Accept moves to lower E with decreasing probability

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Gibbs sampler exercise

• The file clasII.fsa is a FASTA file containing 50 classII epitopes

• gibbss_mc -iw -w 1,0,0,1,0,1,0,0,1 -m gibbs.mat classII.fsa– The options -iw and -w 1,0,0,1,0,1,0,0,1 increase matrix

weight on important anchor positions in binding motif– Make and view logo

• Use the matrix to predict classII epitopes– cl2pred -mat gibbs.mat classII.eval.dat | grep -v # | args

4,5 | xycorr– Do you understand what is going on in this command?

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Hidden Markov Models

• Weight matrices do not deal with insertions and deletions

• In alignments, this is done in an ad-hoc manner by optimization of the two gap penalties for first gap and gap extension

• HMM is a natural frame work where insertions/deletions are dealt with explicitly

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

HMM (a simple example)

ACA---ATG

TCAACTATC

ACAC--AGC

AGA---ATC

ACCG--ATC

• Example from A. Krogh

• Core region defines the number of states in the HMM (red)

• Insertion and deletion statistics are derived from the non-core part of the alignment (black)

Core of alignment

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

.8

.2

ACGT

ACGT

ACGT

ACGT

ACGT

ACGT.8

.8 .8.8

.2.2.2

.2

1

ACGT .2

.2

.2

.4

1. .4 1. 1.1.

.6.6

.4

HMM construction

ACA---ATG

TCAACTATC

ACAC--AGC

AGA---ATC

ACCG--ATC

• 5 matches. A, 2xC, T, G• 5 transitions in gap region

• C out, G out• A-C, C-T, T out• Out transition 3/5• Stay transition 2/5

ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x1x0.8x1x0.2 = 3.3x10-2

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Align sequence to HMMACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2 = 3.3x10-2

TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8 = 0.0075x10-2

ACAC--AGC = 1.2x10-2

AGA---ATC = 3.3x10-2

ACCG--ATC = 0.59x10-2

Consensus:

ACAC--ATC = 4.7x10-2, ACA---ATC = 13.1x10-2

Exceptional:

TGCT--AGG = 0.0023x10-2

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Align sequence to HMM - Null model

• Score depends strongly on length

• Null model is a random model. For length L the score is 0.25L

• Log-odds score for sequence S

Log( P(S)/0.25L)• Positive score means

more likely than Null model

ACA---ATG = 4.9

TCAACTATC = 3.0 ACAC--AGC = 5.3AGA---ATC = 4.9ACCG--ATC = 4.6Consensus:ACAC--ATC = 6.7 ACA---ATC = 6.3Exceptional:TGCT--AGG = -0.97

Note!

sequence information, logos and hidden markov models

Documents