t cell epitope predictions using bioinformatics (neural networks and hidden markov models)

31
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models) Morten Nielsen, CBS, BioCentrum, DTU

Upload: deana

Post on 16-Jan-2016

43 views

Category:

Documents


0 download

DESCRIPTION

T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models). Morten Nielsen, CBS, BioCentrum, DTU. Processing of intracellular proteins. MHC binding. http://www.nki.nl/nkidep/h4/neefjes/neefjes.htm. What makes a peptide a potential and effective epitope?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

T cell Epitope predictionsusing bioinformatics

(Neural Networks andhidden Markov models)

Morten Nielsen, CBS, BioCentrum,

DTU

Page 2: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Processing of intracellular proteins 

http://www.nki.nl/nkidep/h4/neefjes/neefjes.htm

MHC binding

Page 3: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

What makes a peptide a potential and effective

epitope?• Part of a pathogen protein• Successful processing

– Proteasome cleavage– TAP binding

• Binds to MHC molecule• Protein function

– Early in replication• Sequence conservation in

evolution

Sars virus

Page 4: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

From proteins to From proteins to immunogensimmunogens

Lauemøller et al., 2000

20% processed 0.5% bind MHC 50% CTL response

=> 1/2000 peptide are immunogenic

Page 5: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Location of class I epitopes

GP1200 proteinStructure(1GM9)

Page 6: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Page 7: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

MHC class I with peptideMHC class I with peptide

http://www.nki.nl/nkidep/h4/neefjes/neefjes.htm

Anchor positions

Page 8: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

 

Prediction of HLA binding specificity

• Simple Motifs– Allowed/non allowed amino acids

• Extended motifs– Amino acid preferences (SYFPEITHI)– Anchor/Preferred/other amino acids

• Hidden Markov models– Peptide statistics from sequence alignment

• Neural networks– Can take sequence correlations into account

Page 9: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Where to get data?• SYFPEITHI database

– 3500 peptides known to bind to HMC class I and II – Only published data

• MHCpep– 13000 peptides known to bind to HMC class I and II – Published data and direct submission– No update since 1998

• Binding affinity assays– Quantitative data. How strong does a peptide bind

to the MHC molecule?– Costly and people do not publish negative results..

 

Page 10: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Databases and web resources

• HLA Informatics Group, ANRI (HLA sequence database)

• IMGT/HLA Database (HLA sequence database)• SYFPEITHI (Database of HLA Class I and II peptides)• MHCPEP (Database of HLA Class I and II peptides)• BIMAS (HLA Class I predictor)• SYFPEITHI (HLA Class I predictor)• NetMHC (HLA Class I prediction)

Page 11: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Sequence informationSLLPAIVEL YLLPAIVHI TLWVDPYEV GLVPFLVSV KLLEPVLLL LLDVPTAAV LLDVPTAAV LLDVPTAAVLLDVPTAAV VLFRGGPRG MVDGTLLLL YMNGTMSQV MLLSVPLLL SLLGLLVEV ALLPPINIL TLIKIQHTLHLIDYLVTS ILAPPVVKL ALFPQLVIL GILGFVFTL STNRQSGRQ GLDVLTAKV RILGAVAKV QVCERIPTIILFGHENRV ILMEHIHKL ILDQKINEV SLAGGIIGV LLIENVASL FLLWATAEA SLPDFGISY KKREEAPSLLERPGGNEI ALSNLEVKL ALNELLQHV DLERKVESL FLGENISNF ALSDHHIYL GLSEFTEYL STAPPAHGVPLDGEYFTL GVLVGVALI RTLDKVLEV HLSTAFARV RLDSYVRSL YMNGTMSQV GILGFVFTL ILKEPVHGVILGFVFTLT LLFGYPVYV GLSPTVWLS WLSLLVPFV FLPSDFFPS CLGGLLTMV FIAGNSAYE KLGEFYNQMKLVALGINA DLMGYIPLV RLVTLKDIV MLLAVLYCL AAGIGILTV YLEPGPVTA LLDGTATLR ITDQVPFSVKTWGQYWQV TITDQVPFS AFHHVAREL YLNKIQNSL MMRKLAILS AIMDKNIIL IMDKNIILK SMVGNWAKVSLLAPGAKQ KIFGSLAFL ELVSEFSRM KLTPLCVTL VLYRYGSFS YIGEVLVSV CINGVCWTV VMNILLQYVILTVILGVL KVLEYVIKV FLWGPRALV GLSRYVARL FLLTRILTI HLGNVKYLV GIAGGLALL GLQDCTMLVTGAPVTYST VIYQYMDDL VLPDVFIRC VLPDVFIRC AVGIGIAVV LVVLGLLAV ALGLGLLPV GIGIGVLAAGAGIGVAVL IAGIGILAI LIVIGILIL LAGIGLIAA VDGIGILTI GAGIGVLTA AAGIGIIQI QAGIGILLAKARDPHSGH KACDPHSGH ACDPHSGHF SLYNTVATL RGPGRAFVT NLVPMVATV GLHCYEQLV PLKQHFQIVAVFDRKSDA LLDFVRFMG VLVKSPNHV GLAPPQHLI LLGRNSFEV PLTFGWCYK VLEWRFDSR TLNAWVKVVGLCTLVAML FIDSYICQV IISAVVGIL VMAGVGSPY LLWTLVVLL SVRDRLARL LLMDCSGSI CLTSTVQLVVLHDDLLEA LMWITQCFL SLLMWITQC QLSLLMWIT LLGATCMFV RLTRFLSRV YMDGTMSQV FLTPKKLQCISNDVCAQV VKTDGNPPE SVYDFFVWL FLYGALLLA VLFSSDFRI LMWAKIGPV SLLLELEEV SLSRFSWGAYTAFTIPSI RLMKQDFSV RLPRIFCSC FLWGPRAYA RLLQETELV SLFEGIDFY SLDQSVVEL RLNMFTPYINMFTPYIGV LMIIPLINV TLFIGSHVV SLVIVTTFV VLQWASLAV ILAKFLHWL STAPPHVNV LLLLTVLTVVVLGVVFGI ILHNGAYSL MIMVKCWMI MLGTHTMEV MLGTHTMEV SLADTNSLA LLWAARPRL GVALQTMKQGLYDGMEHL KMVELVHFL YLQLVFGIE MLMAQEALA LMAQEALAF VYDGREHTV YLSGANLNL RMFPNAPYLEAAGIGILT TLDSQVMSL STPPPGTRV KVAELVHFL IMIGVLVGV ALCRWGLLL LLFAGVQCQ VLLCESTAVYLSTAFARV YLLEMLWRL SLDDYNHLV RTLDKVLEV GLPVEYLQV KLIANNTRV FIYAGSLSA KLVANNTRLFLDEFMEGV ALQPGTALL VLDGLDVLL SLYSFPEPE ALYVDSLFF SLLQHLIGL ELTLGEFLK MINAYLDKLAAGIGILTV FLPSDFFPS SVRDRLARL SLREWLLRI LLSAWILTA AAGIGILTV AVPDEIPPL FAYDGKDYIAAGIGILTV FLPSDFFPS AAGIGILTV FLPSDFFPS AAGIGILTV FLWGPRALV ETVSEQSNV ITLWQRPLV

Page 12: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Sequence logo

• Height of a column equal to log 20 + p log p

• Relative height of a letter is p

• Highly useful tool to visualize sequence motifs

High information positions

MHC class IHLA-A0201

http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html

Page 13: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Characterizing a binding motif

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

10 peptides known to bind MHC What can we learn?

1. A at P1 favors binding?

2. I is not allowed at P9? 3. K at P4 favors binding?

Page 14: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Sequence information

• Description of binding motif

• ExamplePA = 6/10

PG = 2/10

PT = PK = 1/10

PC = PD = …PV = 0

• Problems– Few data– Data

redundancy/duplication

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

Page 15: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Sequence information Raw sequence counting

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

Page 16: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Pseudo-count and sequence weighting

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

• Poor or biased sampling of sequence space

• I is not found at position P9. Does this mean that I is forbidden?

• No! Use Blosum substitution matrix to estimate pseudo frequency of I at P9

}Similar sequencesWeight 1/5

Page 17: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

The Blosum matrix

Page 18: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Sequence weightingALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

Page 19: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Pseudo counts

• Sequence weighting and pseudo count– Prediction accuracy

0.60

• Motif found on all data (485)– Prediction accuracy

0.79

Page 20: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Weight matrices

• Estimate amino acid frequencies from alignment

• Now a weight matrix is given as

Wij = log(pij/qj)– Here i is a position in the motif, and j an amino acid.

qj is the background frequency for amino acid j.

• W is a L x 20 matrix, L is motif length• Score sequences to weight matrix by looking

up and adding L values from matrix

Page 21: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Scoring sequences to a weight matrix

A R N D C Q E G H I L K M F P S T W Y V 1 0.6 0.4 -3.5 -2.4 -0.4 -1.9 -2.7 0.3 -1.1 1.0 0.3 0.0 1.4 1.2 -2.7 1.4 -1.2 -2.0 1.1 0.7 2 -1.6 -6.6 -6.5 -5.4 -2.5 -4.0 -4.7 -3.7 -6.3 1.0 5.1 -3.7 3.1 -4.2 -4.3 -4.2 -0.2 -5.9 -3.8 0.4 3 0.2 -1.3 0.1 1.5 0.0 -1.8 -3.3 0.4 0.5 -1.0 0.3 -2.5 1.2 1.0 -0.1 -0.3 -0.5 3.4 1.6 0.0 4 -0.1 -0.1 -2.0 2.0 -1.6 0.5 0.8 2.0 -3.3 0.1 -1.7 -1.0 -2.2 -1.6 1.7 -0.6 -0.2 1.3 -6.8 -0.7 5 -1.6 -0.1 0.1 -2.2 -1.2 0.4 -0.5 1.9 1.2 -2.2 -0.5 -1.3 -2.2 1.7 1.2 -2.5 -0.1 1.7 1.5 1.0 6 -0.7 -1.4 -1.0 -2.3 1.1 -1.3 -1.4 -0.2 -1.0 1.8 0.8 -1.9 0.2 1.0 -0.4 -0.6 0.4 -0.5 -0.0 2.1 7 1.1 -3.8 -0.2 -1.3 1.3 -0.3 -1.3 -1.4 2.1 0.6 0.7 -5.0 1.1 0.9 1.3 -0.5 -0.9 2.9 -0.4 0.5 8 -2.2 1.0 -0.8 -2.9 -1.4 0.4 0.1 -0.4 0.2 -0.0 1.1 -0.5 -0.5 0.7 -0.3 0.8 0.8 -0.7 1.3 -1.1 9 -0.2 -3.5 -6.1 -4.5 0.7 -0.8 -2.5 -4.0 -2.6 0.9 2.8 -3.0 -1.8 -1.4 -6.2 -1.9 -1.6 -4.9 -1.6 4.5

ILYQVPFSVALPYWNFATMTAQWWLDA

Which peptide is most likely to bind?Which peptide second?

15.0 -3.4 0.8

Page 22: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

How to predict• The effect on the binding affinity of

having a given amino acid at one position can be influenced by the amino acids at other positions in the peptide (sequence correlations).

– Two adjacent amino acids may for example compete for the space in a pocket in the MHC molecule.

• Artificial neural networks (ANN) are ideally suited to take such correlations into account

 

Page 23: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Neural networks• Neural networks

can learn higher order correlations!– What does this

mean?

A A => 0A C => 1C A => 1C C => 0

No linear function can learn this pattern

Page 24: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Neural networks

w11

w12

v1

w21w22

v2

XOR(x1,x2) = (x1 + x2) − 2 ⋅ x1 ⋅ x2 = y − z

y = x1 + x2

z = 2 ⋅ x1 ⋅ x2

Page 25: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Evaluation of prediction accuracy 

True positive proportion = TP/(AP)

False positive proportion = FP/(AN)

Aroc=0.5

Aroc=0.8

Roc curves

Pearson correlation

TPFP

APAN

Page 26: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Epitope predictionsSequence motif and HMM’s 

Sequence motif HMM

cc: 0.76Aroc: 0.92

cc: 0.80Aroc: 0.95

Page 27: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Epitope prediction. Neural Networks 

cc: 0.91Aroc: 0.98

Page 28: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Evaluation of prediction accuracy

0

0.2

0.4

0.6

0.8

1

MotifHmm ANN

PearAroc

Page 29: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Hepatitis C virus. Epitope predictions 

Page 30: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Proteasomal cleavage

• Netchop (http://www.cbs.dtu.dk/services/NetChop-3.0/)

– Epitopes have strong C terminal cleavage– Epitopes can have strong internal cleavage

sites

• Selection strategy– High binding peptides– High cleavage probability at C terminal

NMVPFFPPV..S.....S

Page 31: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CEN

TER

FO

R B

IOLO

GIC

AL S

EQ

UEN

CE A

NA

LY

SIS

TEC

HN

ICA

L U

NIV

ER

SIT

Y O

F D

EN

MA

RK

DTU

Hvad nu?

• 29 marts. Introduktion til hidden Markov models og weight matrices

• 5 april. Introduktion til neural networks

• 12 april. Introduktion til projekt• 10 maj. Aflever projekt