intro protein structure motifs motif databases end … protein structure motifs motif databases end...

56
Intro Protein structure Motifs Motif databases End Last time Probability based methods How find a good root? Reliability Reconciliation analysis

Upload: nguyencong

Post on 08-May-2018

225 views

Category:

Documents


2 download

TRANSCRIPT

Intro Protein structure Motifs Motif databases End

Last time

• Probability based methods• How find a good root?• Reliability• Reconciliation analysis

Intro Protein structure Motifs Motif databases End

Today

• Intro to proteinstructure• Motifs and domains

Intro Protein structure Motifs Motif databases End

”First dogma of Bioinformatics”

Sequence → structure → function

• Want to avoid determining structure• Expensive• Difficult• Sometimes impossible?

• Bioinfo dream: Structure from sequence!• ”How does the protein fold”?

Intro Protein structure Motifs Motif databases End

”First dogma of Bioinformatics”

Sequence → structure → function

• Want to avoid determining structure• Expensive• Difficult• Sometimes impossible?

• Bioinfo dream: Structure from sequence!• ”How does the protein fold”?

Intro Protein structure Motifs Motif databases End

”First dogma of Bioinformatics”

Sequence → structure → function

• Want to avoid determining structure• Expensive• Difficult• Sometimes impossible?

• Bioinfo dream: Structure from sequence!• ”How does the protein fold”?

Intro Protein structure Motifs Motif databases End

Ab initio folding?• Folding from sequence seems out of reach

• But...:

Intro Protein structure Motifs Motif databases End

Ab initio folding?• Folding from sequence seems out of reach

• But...:

Intro Protein structure Motifs Motif databases End

What to do in silico?

1. Compromise and use what you’ve got.”Recycle” structures

2. Find and understand protein buildingblocks: motifs and domains.

3. Identify certain protein types:transmembrane proteins

4. ”Why bother? Sequences are informative!”

Intro Protein structure Motifs Motif databases End

What to do in silico?

1. Compromise and use what you’ve got.”Recycle” structures

2. Find and understand protein buildingblocks: motifs and domains.

3. Identify certain protein types:transmembrane proteins

4. ”Why bother? Sequences are informative!”

Intro Protein structure Motifs Motif databases End

What to do in silico?

1. Compromise and use what you’ve got.”Recycle” structures

2. Find and understand protein buildingblocks: motifs and domains.

3. Identify certain protein types:transmembrane proteins

4. ”Why bother? Sequences are informative!”

Intro Protein structure Motifs Motif databases End

What to do in silico?

1. Compromise and use what you’ve got.”Recycle” structures

2. Find and understand protein buildingblocks: motifs and domains.

3. Identify certain protein types:transmembrane proteins

4. ”Why bother? Sequences are informative!”

Intro Protein structure Motifs Motif databases End

Example 1: Motifs and domains

(Bjarnadottir et al, 2004)

Some typical G-protein coupled receptorsSmall circles: glycolization sitesOther symbols: domains

Intro Protein structure Motifs Motif databases End

Example 2: Domains and structure

Intro Protein structure Motifs Motif databases End

Our goals

Motifs: Representation and use

Domains: Definitions, hidden Markov models(HMM), applications, databases

PSI-Blast: Sensitive search toolSecondary structure: In general and the TM

special case

Intro Protein structure Motifs Motif databases End

Our goals

Motifs: Representation and useDomains: Definitions, hidden Markov models

(HMM), applications, databases

PSI-Blast: Sensitive search toolSecondary structure: In general and the TM

special case

Intro Protein structure Motifs Motif databases End

Our goals

Motifs: Representation and useDomains: Definitions, hidden Markov models

(HMM), applications, databasesPSI-Blast: Sensitive search tool

Secondary structure: In general and the TMspecial case

Intro Protein structure Motifs Motif databases End

Our goals

Motifs: Representation and useDomains: Definitions, hidden Markov models

(HMM), applications, databasesPSI-Blast: Sensitive search toolSecondary structure: In general and the TM

special case

Intro Protein structure Motifs Motif databases End

Motifs

• Short subsequences, DNA or AA, 5 – 20positions long.

• Foremost application: binding sites• Motifs grouped in families. Confused

terminology.• Fingerprints: Combinations of motifs

Intro Protein structure Motifs Motif databases End

Motifs

• Short subsequences, DNA or AA, 5 – 20positions long.

• Foremost application: binding sites

• Motifs grouped in families. Confusedterminology.

• Fingerprints: Combinations of motifs

Intro Protein structure Motifs Motif databases End

Motifs

• Short subsequences, DNA or AA, 5 – 20positions long.

• Foremost application: binding sites• Motifs grouped in families. Confused

terminology.

• Fingerprints: Combinations of motifs

Intro Protein structure Motifs Motif databases End

Motifs

• Short subsequences, DNA or AA, 5 – 20positions long.

• Foremost application: binding sites• Motifs grouped in families. Confused

terminology.• Fingerprints: Combinations of motifs

Intro Protein structure Motifs Motif databases End

Motif representationMTWDNRLAAFAQNYANQRAMTWDNRLAAYAQNYANQRIMTWDNRLAAYAQNYANQRIMTWDDGLAAYAQNYANQRAVSWSTKLQAYAQSYANQRILTWDDQVAAYAQNYASQLALTWDDQVAAYAQNYASQLALTWDDQVAAYAQNYASQLAVSWSTKLQGFAQSYANQRIMSWDANLASRAQNYANSRAVSWSTKLQAFAQNYANQRILRWDEKVAAYARNYANQRKLRWDEKVAAYARNYANQRKVSWSTKLQAFAQNYANQRILVWNDELAQIAQVWANQCNLVWNDELAQIAQVWANQCNLTWDDEVAAYAQNYVSQLALTWDDQVAAYAQNYASQLAVSWSTKLQAFAQNYANQRILVWSDELAYIAQVWANQCQLVWNDELAYVAQVWANQCQ...

(shortened)

Motif ”V5TPXLIKE”95 seqs, width 19.• Multialignment

• Pattern notation, eg:[LMV]-[RSTV]-W-[DSN]-...

• Profiles and PSSM, PWM• Visualize with sequences

logo

Intro Protein structure Motifs Motif databases End

Motif representationMTWDNRLAAFAQNYANQRAMTWDNRLAAYAQNYANQRIMTWDNRLAAYAQNYANQRIMTWDDGLAAYAQNYANQRAVSWSTKLQAYAQSYANQRILTWDDQVAAYAQNYASQLALTWDDQVAAYAQNYASQLALTWDDQVAAYAQNYASQLAVSWSTKLQGFAQSYANQRIMSWDANLASRAQNYANSRAVSWSTKLQAFAQNYANQRILRWDEKVAAYARNYANQRKLRWDEKVAAYARNYANQRKVSWSTKLQAFAQNYANQRILVWNDELAQIAQVWANQCNLVWNDELAQIAQVWANQCNLTWDDEVAAYAQNYVSQLALTWDDQVAAYAQNYASQLAVSWSTKLQAFAQNYANQRILVWSDELAYIAQVWANQCQLVWNDELAYVAQVWANQCQ...

(shortened)

Motif ”V5TPXLIKE”95 seqs, width 19.• Multialignment• Pattern notation, eg:

[LMV]-[RSTV]-W-[DSN]-...

• Profiles and PSSM, PWM• Visualize with sequences

logo

Intro Protein structure Motifs Motif databases End

Motif representationMTWDNRLAAFAQNYANQRAMTWDNRLAAYAQNYANQRIMTWDNRLAAYAQNYANQRIMTWDDGLAAYAQNYANQRAVSWSTKLQAYAQSYANQRILTWDDQVAAYAQNYASQLALTWDDQVAAYAQNYASQLALTWDDQVAAYAQNYASQLAVSWSTKLQGFAQSYANQRIMSWDANLASRAQNYANSRAVSWSTKLQAFAQNYANQRILRWDEKVAAYARNYANQRKLRWDEKVAAYARNYANQRKVSWSTKLQAFAQNYANQRILVWNDELAQIAQVWANQCNLVWNDELAQIAQVWANQCNLTWDDEVAAYAQNYVSQLALTWDDQVAAYAQNYASQLAVSWSTKLQAFAQNYANQRILVWSDELAYIAQVWANQCQLVWNDELAYVAQVWANQCQ...

(shortened)

Motif ”V5TPXLIKE”95 seqs, width 19.• Multialignment• Pattern notation, eg:

[LMV]-[RSTV]-W-[DSN]-...• Profiles and PSSM, PWM

• Visualize with sequenceslogo

Intro Protein structure Motifs Motif databases End

Motif representationMTWDNRLAAFAQNYANQRAMTWDNRLAAYAQNYANQRIMTWDNRLAAYAQNYANQRIMTWDDGLAAYAQNYANQRAVSWSTKLQAYAQSYANQRILTWDDQVAAYAQNYASQLALTWDDQVAAYAQNYASQLALTWDDQVAAYAQNYASQLAVSWSTKLQGFAQSYANQRIMSWDANLASRAQNYANSRAVSWSTKLQAFAQNYANQRILRWDEKVAAYARNYANQRKLRWDEKVAAYARNYANQRKVSWSTKLQAFAQNYANQRILVWNDELAQIAQVWANQCNLVWNDELAQIAQVWANQCNLTWDDEVAAYAQNYVSQLALTWDDQVAAYAQNYASQLAVSWSTKLQAFAQNYANQRILVWSDELAYIAQVWANQCQLVWNDELAYVAQVWANQCQ...

(shortened)

Motif ”V5TPXLIKE”95 seqs, width 19.• Multialignment• Pattern notation, eg:

[LMV]-[RSTV]-W-[DSN]-...• Profiles and PSSM, PWM• Visualize with sequences

logo

Intro Protein structure Motifs Motif databases End

Sequence logo

PSSM of PR00837A (V5TPXLIKE;) 95 sequences.

0

1

2

3

4

bits |

1

IVLM

2

AYNIRQCESKVT

3 YW

4 LYHSND

5 GIVKFRSAQHYEMPTNDC

6 IRMASGQNDKTE

7

TIAMVL

8

SYTQEA

9

GDKYSEVHNRQTA

10 LAVRTK

IMFSNY

11 S

TMA

12 AVTKE

IRHWMQ

13 TEVW

SIAQDKRN

14 RN

FHYW

15 V

GSA

16 VR

THYEAQKSDN

17 ME

SYGHNTRKQ

18 YL

RC

19 SLNRK

TQDHVAIP

• Height indicate conservation(Too many details: Height is the Kullback-Leibler distance to

the uniform distribution)

• Symbol height proportional to frequency

Intro Protein structure Motifs Motif databases End

Start of translation

http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html

Intro Protein structure Motifs Motif databases End

Phosporelation site, PKA

(Blom et al, 1998)

Intro Protein structure Motifs Motif databases End

Profiles

• Multialignments convenient• Patterns sparse with information• Logos are pretty pictures!

• Profile: Matrix F with frequency information

Fr ,c is fraction r in position cPos: 1 2 3 4 5 6A 0.6 0.15 0.0 0.2 0.0 0.0C 0.0 0.25 1.0 0.4 0.0 0.55G 0.3 0.25 0.0 0.4 0.5 0.0T 0.1 0.35 0.0 0.0 0.5 0.45 WebLogo 3.0b14

0.0

1.0

2.0

bits

TGA

A

GCTCA

GC

5TG

TC

Intro Protein structure Motifs Motif databases End

Profiles

• Multialignments convenient• Patterns sparse with information• Logos are pretty pictures!• Profile: Matrix F with frequency information

Fr ,c is fraction r in position cPos: 1 2 3 4 5 6A 0.6 0.15 0.0 0.2 0.0 0.0C 0.0 0.25 1.0 0.4 0.0 0.55G 0.3 0.25 0.0 0.4 0.5 0.0T 0.1 0.35 0.0 0.0 0.5 0.45 WebLogo 3.0b14

0.0

1.0

2.0

bits

TGA

A

GCTCA

GC

5TG

TC

Intro Protein structure Motifs Motif databases End

Profiles

Pos: 1 2 3 4 5 6A 0.6 0.15 0.0 0.2 0.0 0.0C 0.0 0.25 1.0 0.4 0.0 0.55G 0.3 0.25 0.0 0.4 0.5 0.0T 0.1 0.35 0.0 0.0 0.5 0.45

• Fr ,c = nr ,c/n, where nr ,c number of r inposition c, and n is sequence count.

• For A in position 1: nA,1 = 12 and n = 20

Intro Protein structure Motifs Motif databases End

Profiles

Pos: 1 2 3 4 5 6A 0.6 0.15 0.0 0.2 0.0 0.0C 0.0 0.25 1.0 0.4 0.0 0.55G 0.3 0.25 0.0 0.4 0.5 0.0T 0.1 0.35 0.0 0.0 0.5 0.45

• Fr ,c = nr ,c/n, where nr ,c number of r inposition c, and n is sequence count.

• For A in position 1: nA,1 = 12 and n = 20

Intro Protein structure Motifs Motif databases End

Profiles

Pos: 1 2 3 4 5 6A 0.6 0.15 0.0 0.2 0.0 0.0C 0.0 0.25 1.0 0.4 0.0 0.55G 0.3 0.25 0.0 0.4 0.5 0.0T 0.1 0.35 0.0 0.0 0.5 0.45

• Probability of AACATT being ”produced” byprofile:

0.6×0.15×1.0×0.2×0.5×0.45 = 0.00405

• Is that good? Interpretation?

Intro Protein structure Motifs Motif databases End

PSSM: Better than profile

• Want a log-odds score!

• PSSM=Position Specific Scoring Matrix• Mr ,c = 10 log2

(Fr ,c/πr

), where πr is

frequency of r in our data.• Let πA = πC = πG = πT = 0.25.• MA,1 = 10 log2(FA,1/0.25) =

10 log2(0.6/0.25) = 12.6• MC,2 = 10 log2(FC,2/0.25) =

10 log2(0.25/0.25) = 0

Intro Protein structure Motifs Motif databases End

PSSM: Better than profile

• Want a log-odds score!• PSSM=Position Specific Scoring Matrix

• Mr ,c = 10 log2(Fr ,c/πr

), where πr is

frequency of r in our data.• Let πA = πC = πG = πT = 0.25.• MA,1 = 10 log2(FA,1/0.25) =

10 log2(0.6/0.25) = 12.6• MC,2 = 10 log2(FC,2/0.25) =

10 log2(0.25/0.25) = 0

Intro Protein structure Motifs Motif databases End

PSSM: Better than profile

• Want a log-odds score!• PSSM=Position Specific Scoring Matrix• Mr ,c = 10 log2

(Fr ,c/πr

), where πr is

frequency of r in our data.

• Let πA = πC = πG = πT = 0.25.• MA,1 = 10 log2(FA,1/0.25) =

10 log2(0.6/0.25) = 12.6• MC,2 = 10 log2(FC,2/0.25) =

10 log2(0.25/0.25) = 0

Intro Protein structure Motifs Motif databases End

PSSM: Better than profile

• Want a log-odds score!• PSSM=Position Specific Scoring Matrix• Mr ,c = 10 log2

(Fr ,c/πr

), where πr is

frequency of r in our data.• Let πA = πC = πG = πT = 0.25.

• MA,1 = 10 log2(FA,1/0.25) =10 log2(0.6/0.25) = 12.6

• MC,2 = 10 log2(FC,2/0.25) =10 log2(0.25/0.25) = 0

Intro Protein structure Motifs Motif databases End

PSSM: Better than profile

• Want a log-odds score!• PSSM=Position Specific Scoring Matrix• Mr ,c = 10 log2

(Fr ,c/πr

), where πr is

frequency of r in our data.• Let πA = πC = πG = πT = 0.25.• MA,1 = 10 log2(FA,1/0.25) =

10 log2(0.6/0.25) = 12.6

• MC,2 = 10 log2(FC,2/0.25) =10 log2(0.25/0.25) = 0

Intro Protein structure Motifs Motif databases End

PSSM: Better than profile

• Want a log-odds score!• PSSM=Position Specific Scoring Matrix• Mr ,c = 10 log2

(Fr ,c/πr

), where πr is

frequency of r in our data.• Let πA = πC = πG = πT = 0.25.• MA,1 = 10 log2(FA,1/0.25) =

10 log2(0.6/0.25) = 12.6• MC,2 = 10 log2(FC,2/0.25) =

10 log2(0.25/0.25) = 0

Intro Protein structure Motifs Motif databases End

PSSM M from our profile FProfile:

Pos: 1 2 3 4 5 6A 0.6 0.15 0.0 0.2 0.0 0.0C 0.0 0.25 1.0 0.4 0.0 0.55G 0.3 0.25 0.0 0.4 0.5 0.0T 0.1 0.35 0.0 0.0 0.5 0.45

PSSM:Pos: 1 2 3 4 5 6A 12.6 -7.4 −∞ -3.2 −∞ −∞C −∞ 0.0 20.0 6.8 −∞ 11.4G 2.63 0.0 −∞ 6.8 10.0 −∞T -13.2 4.9 −∞ −∞ 10.0 8.5

Score for AACATT:

12.6− 7.4 + 20.0− 3.2 + 10.0 + 8.5 = 40.5

Intro Protein structure Motifs Motif databases End

PSSM M from our profile FProfile:

Pos: 1 2 3 4 5 6A 0.6 0.15 0.0 0.2 0.0 0.0C 0.0 0.25 1.0 0.4 0.0 0.55G 0.3 0.25 0.0 0.4 0.5 0.0T 0.1 0.35 0.0 0.0 0.5 0.45

PSSM:Pos: 1 2 3 4 5 6A 12.6 -7.4 −∞ -3.2 −∞ −∞C −∞ 0.0 20.0 6.8 −∞ 11.4G 2.63 0.0 −∞ 6.8 10.0 −∞T -13.2 4.9 −∞ −∞ 10.0 8.5

Score for AACATT:

12.6− 7.4 + 20.0− 3.2 + 10.0 + 8.5 = 40.5

Intro Protein structure Motifs Motif databases End

Generalizing with PSSM?

How handle a new variant of a motif?Pos: 1 2 3 4 5 6A 12.6 -7.4 −∞ -3.2 −∞ −∞C −∞ 0.0 20.0 6.8 −∞ 11.4G 2.63 0.0 −∞ 6.8 10.0 −∞T -13.2 4.9 −∞ −∞ 10.0 8.5

Score for ATCTTT?

12.6 + 4.9 + 20.0−∞+ 10.0 + 8.5 = 56−∞

Intro Protein structure Motifs Motif databases End

Generalizing with PSSM?

How handle a new variant of a motif?Pos: 1 2 3 4 5 6A 12.6 -7.4 −∞ -3.2 −∞ −∞C −∞ 0.0 20.0 6.8 −∞ 11.4G 2.63 0.0 −∞ 6.8 10.0 −∞T -13.2 4.9 −∞ −∞ 10.0 8.5

Score for ATCTTT?

12.6 + 4.9 + 20.0−∞+ 10.0 + 8.5 = 56−∞

Intro Protein structure Motifs Motif databases End

”Pseudo counts” for profiles• Idea: Pretend you have seen all possible

motifs

• Pseudo counts: αr is number of ”pseudoobservations” of r .

• Include in profile calculations:

Fr ,c =nr ,c

+ αr

n

+∑

r αr

• Example 1: Let αA = αC = αG = αT = 1.FA,1 = 12+1

20+4 = 0.54.• Example 2: We had nC,1 = 0.

FC,1 = 0+120+4 = 0.042

• Result: Can use PSSM to find novel motifs

Intro Protein structure Motifs Motif databases End

”Pseudo counts” for profiles• Idea: Pretend you have seen all possible

motifs• Pseudo counts: αr is number of ”pseudo

observations” of r .

• Include in profile calculations:

Fr ,c =nr ,c + αr

n +∑

r αr

• Example 1: Let αA = αC = αG = αT = 1.FA,1 = 12+1

20+4 = 0.54.• Example 2: We had nC,1 = 0.

FC,1 = 0+120+4 = 0.042

• Result: Can use PSSM to find novel motifs

Intro Protein structure Motifs Motif databases End

”Pseudo counts” for profiles• Idea: Pretend you have seen all possible

motifs• Pseudo counts: αr is number of ”pseudo

observations” of r .• Include in profile calculations:

Fr ,c =nr ,c + αr

n +∑

r αr

• Example 1: Let αA = αC = αG = αT = 1.FA,1 = 12+1

20+4 = 0.54.• Example 2: We had nC,1 = 0.

FC,1 = 0+120+4 = 0.042

• Result: Can use PSSM to find novel motifs

Intro Protein structure Motifs Motif databases End

”Pseudo counts” for profiles• Idea: Pretend you have seen all possible

motifs• Pseudo counts: αr is number of ”pseudo

observations” of r .• Include in profile calculations:

Fr ,c =nr ,c + αr

n +∑

r αr

• Example 1: Let αA = αC = αG = αT = 1.FA,1 = 12+1

20+4 = 0.54.

• Example 2: We had nC,1 = 0.FC,1 = 0+1

20+4 = 0.042• Result: Can use PSSM to find novel motifs

Intro Protein structure Motifs Motif databases End

”Pseudo counts” for profiles• Idea: Pretend you have seen all possible

motifs• Pseudo counts: αr is number of ”pseudo

observations” of r .• Include in profile calculations:

Fr ,c =nr ,c + αr

n +∑

r αr

• Example 1: Let αA = αC = αG = αT = 1.FA,1 = 12+1

20+4 = 0.54.• Example 2: We had nC,1 = 0.

FC,1 = 0+120+4 = 0.042

• Result: Can use PSSM to find novel motifs

Intro Protein structure Motifs Motif databases End

”Pseudo counts” for profiles• Idea: Pretend you have seen all possible

motifs• Pseudo counts: αr is number of ”pseudo

observations” of r .• Include in profile calculations:

Fr ,c =nr ,c + αr

n +∑

r αr

• Example 1: Let αA = αC = αG = αT = 1.FA,1 = 12+1

20+4 = 0.54.• Example 2: We had nC,1 = 0.

FC,1 = 0+120+4 = 0.042

• Result: Can use PSSM to find novel motifs

Intro Protein structure Motifs Motif databases End

Fast motif searches

• Motifs are small, therefore easy to searchwith. Fast.

• Blast variants exists for motifs.• E-value theory same thanks to log-odds

score!

Intro Protein structure Motifs Motif databases End

Fast motif searches

• Motifs are small, therefore easy to searchwith. Fast.

• Blast variants exists for motifs.

• E-value theory same thanks to log-oddsscore!

Intro Protein structure Motifs Motif databases End

Fast motif searches

• Motifs are small, therefore easy to searchwith. Fast.

• Blast variants exists for motifs.• E-value theory same thanks to log-odds

score!

Intro Protein structure Motifs Motif databases End

Motif databases

PROSITE: Important binding sites”What motifs does my protein have?”• Profiles• Pattern notation• Careful documentation

BLOCKS: Origin to BLOSUM.Presents multialignments!Assembled by most conserved partsof domains.

PRINTS: ”What motif combinations does myprotein have?”

Intro Protein structure Motifs Motif databases End

Motif databases

PROSITE: Important binding sites”What motifs does my protein have?”• Profiles• Pattern notation• Careful documentation

BLOCKS: Origin to BLOSUM.Presents multialignments!Assembled by most conserved partsof domains.

PRINTS: ”What motif combinations does myprotein have?”

Intro Protein structure Motifs Motif databases End

Motif databases

PROSITE: Important binding sites”What motifs does my protein have?”• Profiles• Pattern notation• Careful documentation

BLOCKS: Origin to BLOSUM.Presents multialignments!Assembled by most conserved partsof domains.

PRINTS: ”What motif combinations does myprotein have?”

Intro Protein structure Motifs Motif databases End

Next time

• PSI-Blast• Protein domains• Domain databases• Hidden Markov models?