sequence information, logos and hidden markov models

18
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Sequence information, logos and Hidden Markov Models Morten Nielsen, CBS, BioCentrum, DTU

Upload: nolcha

Post on 21-Jan-2016

20 views

Category:

Documents


0 download

DESCRIPTION

Sequence information, logos and Hidden Markov Models. Morten Nielsen, CBS, BioCentrum, DTU. Information content. Information and entropy Conserved amino acid regions contain high degree of information (high order == low entropy) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sequence information, logos and Hidden Markov Models

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Sequence information, logos and Hidden Markov

ModelsMorten Nielsen,

CBS, BioCentrum, DTU

Page 2: Sequence information, logos and Hidden Markov Models

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Information content

• Information and entropy– Conserved amino acid regions contain high degree of

information (high order == low entropy)– Variable amino acid regions contain low degree of

information (low order == high entropy)

• Shannon information D = log2(N) + pi log2 pi (for proteins N=20, DNA

N=4)

• Conserved residue pA=1, pi<>A=0, D = log2(N) ( = 4.3 for proteins)

• Variable region pA=0.05, pC=0.05, .., D = 0

Page 3: Sequence information, logos and Hidden Markov Models

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Sequence logo

• Height of a column equal to D

• Relative height of a letter is pA

• Highly useful tool to visualize sequence motifs

High information positions

MHC class IILogo from 10 sequences

http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html

Page 4: Sequence information, logos and Hidden Markov Models

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Sequence information

• Description of binding motif

• ExamplePA = 6/10

PG = 2/10

PT = PK = 1/10

PC = PD = …PV = 0

• Problems– Few data– Data

redundancy/duplication

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

Page 5: Sequence information, logos and Hidden Markov Models

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Sequence information Raw sequence counting

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

Page 6: Sequence information, logos and Hidden Markov Models

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Sequence weightingALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

Page 7: Sequence information, logos and Hidden Markov Models

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Pseudo counts

• Sequence weighting and pseudo count

• Motif found on more data

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

Page 8: Sequence information, logos and Hidden Markov Models

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

…and now you

• cp files from /usr/opt/www/pub/CBS/researchgroups/immunology/intro/HMM/exercise

• Make weight matrix and logos using– pep2mat -swt 2 -wlc 0 data > mat– mat2logo mat– ghostview logo.ps

• Include sequence weighting – pep2mat -swt 0 -wlc 0 data > mat– make and view logo– Try the other sequence weighting scheme (clustering) -swt 1. What

difference does this make?• Include pseudo counts

– pep2mat data > mat– make and view logo

Page 9: Sequence information, logos and Hidden Markov Models

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Weight matrices

• Estimate amino acid frequencies from alignment including sequence weighting and pseudo counts

• Construct a weight matrix as

Wij = log(pij/qj)• Here i is a position in the motif, and j an amino

acid. qj is the prior frequency for amino acid j.• W is a L x 20 matrix, L is motif length• Score sequences to weight matrix by looking

up and adding L values from matrix

Page 10: Sequence information, logos and Hidden Markov Models

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Weight matrix predictions

• Use the program seq2hmm to evaluate the prediction accuracy of your weight matrix– seq2hmm -hmm mat -xs eval.set | grep -v # | args 2,3 |

xycorr– What is going on here?

• By leaving out the -xs option you can generate the scores at each position in the sequence. This is often useful for Neural Network training

Page 11: Sequence information, logos and Hidden Markov Models

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

MHC class II prediction

• Complexity of problem– Peptides of

different length– Weak motif signal

• Alignment crucial• Gibbs Monte Carlo

sampler

RFFGGDRGAPKRGYLDPLIRGLLARPAKLQVKPGQPPRLLIYDASNRATGIPAGSLFVYNITTNKYKAFLDKQSALLSSDITASVNCAKPKYVHQNTLKLATGFKGEQGPKGEPDVFKELKVHHANENISRYWAIRTRSGGITYSTNEIDLQLSQEDGQTIE

DRB1*0401 peptides

Page 12: Sequence information, logos and Hidden Markov Models

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Gibbs sample algorithm

RFFGGDRGAPKRG YLDPLIRGLLARPAKLQVKPGQPPRLLIYDASNRATGIPA GSLFVYNITTNKYKAFLDKQ SALLSSDITASVNCAK PKYVHQNTLKLAT GFKGEQGPKGEP DVFKELKVHHANENI SRYWAIRTRSGGI TYSTNEIDLQLSQEDGQTI

Alignment by Gibbs sampler

E = i,j pij * log( p`ij/qi )

Maximize E using MC• Random change in offset• Random shift on box position• Accept moves to higher E always• Accept moves to lower E with decreasing probability

Page 13: Sequence information, logos and Hidden Markov Models

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Gibbs sampler exercise

• The file clasII.fsa is a FASTA file containing 50 classII epitopes

• gibbss_mc -iw -w 1,0,0,1,0,1,0,0,1 -m gibbs.mat classII.fsa– The options -iw and -w 1,0,0,1,0,1,0,0,1 increase matrix

weight on important anchor positions in binding motif– Make and view logo

• Use the matrix to predict classII epitopes– cl2pred -mat gibbs.mat classII.eval.dat | grep -v # | args

4,5 | xycorr– Do you understand what is going on in this command?

Page 14: Sequence information, logos and Hidden Markov Models

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Hidden Markov Models

• Weight matrices do not deal with insertions and deletions

• In alignments, this is done in an ad-hoc manner by optimization of the two gap penalties for first gap and gap extension

• HMM is a natural frame work where insertions/deletions are dealt with explicitly

Page 15: Sequence information, logos and Hidden Markov Models

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

HMM (a simple example)

ACA---ATG

TCAACTATC

ACAC--AGC

AGA---ATC

ACCG--ATC

• Example from A. Krogh

• Core region defines the number of states in the HMM (red)

• Insertion and deletion statistics are derived from the non-core part of the alignment (black)

Core of alignment

Page 16: Sequence information, logos and Hidden Markov Models

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

.8

.2

ACGT

ACGT

ACGT

ACGT

ACGT

ACGT.8

.8 .8.8

.2.2.2

.2

1

ACGT .2

.2

.2

.4

1. .4 1. 1.1.

.6.6

.4

HMM construction

ACA---ATG

TCAACTATC

ACAC--AGC

AGA---ATC

ACCG--ATC

• 5 matches. A, 2xC, T, G• 5 transitions in gap region

• C out, G out• A-C, C-T, T out• Out transition 3/5• Stay transition 2/5

ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x1x0.8x1x0.2 = 3.3x10-2

Page 17: Sequence information, logos and Hidden Markov Models

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Align sequence to HMMACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2 = 3.3x10-2

TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8 = 0.0075x10-2

ACAC--AGC = 1.2x10-2

AGA---ATC = 3.3x10-2

ACCG--ATC = 0.59x10-2

Consensus:

ACAC--ATC = 4.7x10-2, ACA---ATC = 13.1x10-2

Exceptional:

TGCT--AGG = 0.0023x10-2

Page 18: Sequence information, logos and Hidden Markov Models

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Align sequence to HMM - Null model

• Score depends strongly on length

• Null model is a random model. For length L the score is 0.25L

• Log-odds score for sequence S

Log( P(S)/0.25L)• Positive score means

more likely than Null model

ACA---ATG = 4.9

TCAACTATC = 3.0 ACAC--AGC = 5.3AGA---ATC = 4.9ACCG--ATC = 4.6Consensus:ACAC--ATC = 6.7 ACA---ATC = 6.3Exceptional:TGCT--AGG = -0.97

Note!