intro protein structure motifs motif databases end … protein structure motifs motif databases end...
TRANSCRIPT
Intro Protein structure Motifs Motif databases End
Last time
• Probability based methods• How find a good root?• Reliability• Reconciliation analysis
Intro Protein structure Motifs Motif databases End
Today
• Intro to proteinstructure• Motifs and domains
Intro Protein structure Motifs Motif databases End
”First dogma of Bioinformatics”
Sequence → structure → function
• Want to avoid determining structure• Expensive• Difficult• Sometimes impossible?
• Bioinfo dream: Structure from sequence!• ”How does the protein fold”?
Intro Protein structure Motifs Motif databases End
”First dogma of Bioinformatics”
Sequence → structure → function
• Want to avoid determining structure• Expensive• Difficult• Sometimes impossible?
• Bioinfo dream: Structure from sequence!• ”How does the protein fold”?
Intro Protein structure Motifs Motif databases End
”First dogma of Bioinformatics”
Sequence → structure → function
• Want to avoid determining structure• Expensive• Difficult• Sometimes impossible?
• Bioinfo dream: Structure from sequence!• ”How does the protein fold”?
Intro Protein structure Motifs Motif databases End
Ab initio folding?• Folding from sequence seems out of reach
• But...:
Intro Protein structure Motifs Motif databases End
Ab initio folding?• Folding from sequence seems out of reach
• But...:
Intro Protein structure Motifs Motif databases End
What to do in silico?
1. Compromise and use what you’ve got.”Recycle” structures
2. Find and understand protein buildingblocks: motifs and domains.
3. Identify certain protein types:transmembrane proteins
4. ”Why bother? Sequences are informative!”
Intro Protein structure Motifs Motif databases End
What to do in silico?
1. Compromise and use what you’ve got.”Recycle” structures
2. Find and understand protein buildingblocks: motifs and domains.
3. Identify certain protein types:transmembrane proteins
4. ”Why bother? Sequences are informative!”
Intro Protein structure Motifs Motif databases End
What to do in silico?
1. Compromise and use what you’ve got.”Recycle” structures
2. Find and understand protein buildingblocks: motifs and domains.
3. Identify certain protein types:transmembrane proteins
4. ”Why bother? Sequences are informative!”
Intro Protein structure Motifs Motif databases End
What to do in silico?
1. Compromise and use what you’ve got.”Recycle” structures
2. Find and understand protein buildingblocks: motifs and domains.
3. Identify certain protein types:transmembrane proteins
4. ”Why bother? Sequences are informative!”
Intro Protein structure Motifs Motif databases End
Example 1: Motifs and domains
(Bjarnadottir et al, 2004)
Some typical G-protein coupled receptorsSmall circles: glycolization sitesOther symbols: domains
Intro Protein structure Motifs Motif databases End
Our goals
Motifs: Representation and use
Domains: Definitions, hidden Markov models(HMM), applications, databases
PSI-Blast: Sensitive search toolSecondary structure: In general and the TM
special case
Intro Protein structure Motifs Motif databases End
Our goals
Motifs: Representation and useDomains: Definitions, hidden Markov models
(HMM), applications, databases
PSI-Blast: Sensitive search toolSecondary structure: In general and the TM
special case
Intro Protein structure Motifs Motif databases End
Our goals
Motifs: Representation and useDomains: Definitions, hidden Markov models
(HMM), applications, databasesPSI-Blast: Sensitive search tool
Secondary structure: In general and the TMspecial case
Intro Protein structure Motifs Motif databases End
Our goals
Motifs: Representation and useDomains: Definitions, hidden Markov models
(HMM), applications, databasesPSI-Blast: Sensitive search toolSecondary structure: In general and the TM
special case
Intro Protein structure Motifs Motif databases End
Motifs
• Short subsequences, DNA or AA, 5 – 20positions long.
• Foremost application: binding sites• Motifs grouped in families. Confused
terminology.• Fingerprints: Combinations of motifs
Intro Protein structure Motifs Motif databases End
Motifs
• Short subsequences, DNA or AA, 5 – 20positions long.
• Foremost application: binding sites
• Motifs grouped in families. Confusedterminology.
• Fingerprints: Combinations of motifs
Intro Protein structure Motifs Motif databases End
Motifs
• Short subsequences, DNA or AA, 5 – 20positions long.
• Foremost application: binding sites• Motifs grouped in families. Confused
terminology.
• Fingerprints: Combinations of motifs
Intro Protein structure Motifs Motif databases End
Motifs
• Short subsequences, DNA or AA, 5 – 20positions long.
• Foremost application: binding sites• Motifs grouped in families. Confused
terminology.• Fingerprints: Combinations of motifs
Intro Protein structure Motifs Motif databases End
Motif representationMTWDNRLAAFAQNYANQRAMTWDNRLAAYAQNYANQRIMTWDNRLAAYAQNYANQRIMTWDDGLAAYAQNYANQRAVSWSTKLQAYAQSYANQRILTWDDQVAAYAQNYASQLALTWDDQVAAYAQNYASQLALTWDDQVAAYAQNYASQLAVSWSTKLQGFAQSYANQRIMSWDANLASRAQNYANSRAVSWSTKLQAFAQNYANQRILRWDEKVAAYARNYANQRKLRWDEKVAAYARNYANQRKVSWSTKLQAFAQNYANQRILVWNDELAQIAQVWANQCNLVWNDELAQIAQVWANQCNLTWDDEVAAYAQNYVSQLALTWDDQVAAYAQNYASQLAVSWSTKLQAFAQNYANQRILVWSDELAYIAQVWANQCQLVWNDELAYVAQVWANQCQ...
(shortened)
Motif ”V5TPXLIKE”95 seqs, width 19.• Multialignment
• Pattern notation, eg:[LMV]-[RSTV]-W-[DSN]-...
• Profiles and PSSM, PWM• Visualize with sequences
logo
Intro Protein structure Motifs Motif databases End
Motif representationMTWDNRLAAFAQNYANQRAMTWDNRLAAYAQNYANQRIMTWDNRLAAYAQNYANQRIMTWDDGLAAYAQNYANQRAVSWSTKLQAYAQSYANQRILTWDDQVAAYAQNYASQLALTWDDQVAAYAQNYASQLALTWDDQVAAYAQNYASQLAVSWSTKLQGFAQSYANQRIMSWDANLASRAQNYANSRAVSWSTKLQAFAQNYANQRILRWDEKVAAYARNYANQRKLRWDEKVAAYARNYANQRKVSWSTKLQAFAQNYANQRILVWNDELAQIAQVWANQCNLVWNDELAQIAQVWANQCNLTWDDEVAAYAQNYVSQLALTWDDQVAAYAQNYASQLAVSWSTKLQAFAQNYANQRILVWSDELAYIAQVWANQCQLVWNDELAYVAQVWANQCQ...
(shortened)
Motif ”V5TPXLIKE”95 seqs, width 19.• Multialignment• Pattern notation, eg:
[LMV]-[RSTV]-W-[DSN]-...
• Profiles and PSSM, PWM• Visualize with sequences
logo
Intro Protein structure Motifs Motif databases End
Motif representationMTWDNRLAAFAQNYANQRAMTWDNRLAAYAQNYANQRIMTWDNRLAAYAQNYANQRIMTWDDGLAAYAQNYANQRAVSWSTKLQAYAQSYANQRILTWDDQVAAYAQNYASQLALTWDDQVAAYAQNYASQLALTWDDQVAAYAQNYASQLAVSWSTKLQGFAQSYANQRIMSWDANLASRAQNYANSRAVSWSTKLQAFAQNYANQRILRWDEKVAAYARNYANQRKLRWDEKVAAYARNYANQRKVSWSTKLQAFAQNYANQRILVWNDELAQIAQVWANQCNLVWNDELAQIAQVWANQCNLTWDDEVAAYAQNYVSQLALTWDDQVAAYAQNYASQLAVSWSTKLQAFAQNYANQRILVWSDELAYIAQVWANQCQLVWNDELAYVAQVWANQCQ...
(shortened)
Motif ”V5TPXLIKE”95 seqs, width 19.• Multialignment• Pattern notation, eg:
[LMV]-[RSTV]-W-[DSN]-...• Profiles and PSSM, PWM
• Visualize with sequenceslogo
Intro Protein structure Motifs Motif databases End
Motif representationMTWDNRLAAFAQNYANQRAMTWDNRLAAYAQNYANQRIMTWDNRLAAYAQNYANQRIMTWDDGLAAYAQNYANQRAVSWSTKLQAYAQSYANQRILTWDDQVAAYAQNYASQLALTWDDQVAAYAQNYASQLALTWDDQVAAYAQNYASQLAVSWSTKLQGFAQSYANQRIMSWDANLASRAQNYANSRAVSWSTKLQAFAQNYANQRILRWDEKVAAYARNYANQRKLRWDEKVAAYARNYANQRKVSWSTKLQAFAQNYANQRILVWNDELAQIAQVWANQCNLVWNDELAQIAQVWANQCNLTWDDEVAAYAQNYVSQLALTWDDQVAAYAQNYASQLAVSWSTKLQAFAQNYANQRILVWSDELAYIAQVWANQCQLVWNDELAYVAQVWANQCQ...
(shortened)
Motif ”V5TPXLIKE”95 seqs, width 19.• Multialignment• Pattern notation, eg:
[LMV]-[RSTV]-W-[DSN]-...• Profiles and PSSM, PWM• Visualize with sequences
logo
Intro Protein structure Motifs Motif databases End
Sequence logo
PSSM of PR00837A (V5TPXLIKE;) 95 sequences.
0
1
2
3
4
bits |
1
IVLM
2
AYNIRQCESKVT
3 YW
4 LYHSND
5 GIVKFRSAQHYEMPTNDC
6 IRMASGQNDKTE
7
TIAMVL
8
SYTQEA
9
GDKYSEVHNRQTA
10 LAVRTK
IMFSNY
11 S
TMA
12 AVTKE
IRHWMQ
13 TEVW
SIAQDKRN
14 RN
FHYW
15 V
GSA
16 VR
THYEAQKSDN
17 ME
SYGHNTRKQ
18 YL
RC
19 SLNRK
TQDHVAIP
• Height indicate conservation(Too many details: Height is the Kullback-Leibler distance to
the uniform distribution)
• Symbol height proportional to frequency
Intro Protein structure Motifs Motif databases End
Start of translation
http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html
Intro Protein structure Motifs Motif databases End
Profiles
• Multialignments convenient• Patterns sparse with information• Logos are pretty pictures!
• Profile: Matrix F with frequency information
Fr ,c is fraction r in position cPos: 1 2 3 4 5 6A 0.6 0.15 0.0 0.2 0.0 0.0C 0.0 0.25 1.0 0.4 0.0 0.55G 0.3 0.25 0.0 0.4 0.5 0.0T 0.1 0.35 0.0 0.0 0.5 0.45 WebLogo 3.0b14
0.0
1.0
2.0
bits
TGA
A
GCTCA
GC
5TG
TC
Intro Protein structure Motifs Motif databases End
Profiles
• Multialignments convenient• Patterns sparse with information• Logos are pretty pictures!• Profile: Matrix F with frequency information
Fr ,c is fraction r in position cPos: 1 2 3 4 5 6A 0.6 0.15 0.0 0.2 0.0 0.0C 0.0 0.25 1.0 0.4 0.0 0.55G 0.3 0.25 0.0 0.4 0.5 0.0T 0.1 0.35 0.0 0.0 0.5 0.45 WebLogo 3.0b14
0.0
1.0
2.0
bits
TGA
A
GCTCA
GC
5TG
TC
Intro Protein structure Motifs Motif databases End
Profiles
Pos: 1 2 3 4 5 6A 0.6 0.15 0.0 0.2 0.0 0.0C 0.0 0.25 1.0 0.4 0.0 0.55G 0.3 0.25 0.0 0.4 0.5 0.0T 0.1 0.35 0.0 0.0 0.5 0.45
• Fr ,c = nr ,c/n, where nr ,c number of r inposition c, and n is sequence count.
• For A in position 1: nA,1 = 12 and n = 20
Intro Protein structure Motifs Motif databases End
Profiles
Pos: 1 2 3 4 5 6A 0.6 0.15 0.0 0.2 0.0 0.0C 0.0 0.25 1.0 0.4 0.0 0.55G 0.3 0.25 0.0 0.4 0.5 0.0T 0.1 0.35 0.0 0.0 0.5 0.45
• Fr ,c = nr ,c/n, where nr ,c number of r inposition c, and n is sequence count.
• For A in position 1: nA,1 = 12 and n = 20
Intro Protein structure Motifs Motif databases End
Profiles
Pos: 1 2 3 4 5 6A 0.6 0.15 0.0 0.2 0.0 0.0C 0.0 0.25 1.0 0.4 0.0 0.55G 0.3 0.25 0.0 0.4 0.5 0.0T 0.1 0.35 0.0 0.0 0.5 0.45
• Probability of AACATT being ”produced” byprofile:
0.6×0.15×1.0×0.2×0.5×0.45 = 0.00405
• Is that good? Interpretation?
Intro Protein structure Motifs Motif databases End
PSSM: Better than profile
• Want a log-odds score!
• PSSM=Position Specific Scoring Matrix• Mr ,c = 10 log2
(Fr ,c/πr
), where πr is
frequency of r in our data.• Let πA = πC = πG = πT = 0.25.• MA,1 = 10 log2(FA,1/0.25) =
10 log2(0.6/0.25) = 12.6• MC,2 = 10 log2(FC,2/0.25) =
10 log2(0.25/0.25) = 0
Intro Protein structure Motifs Motif databases End
PSSM: Better than profile
• Want a log-odds score!• PSSM=Position Specific Scoring Matrix
• Mr ,c = 10 log2(Fr ,c/πr
), where πr is
frequency of r in our data.• Let πA = πC = πG = πT = 0.25.• MA,1 = 10 log2(FA,1/0.25) =
10 log2(0.6/0.25) = 12.6• MC,2 = 10 log2(FC,2/0.25) =
10 log2(0.25/0.25) = 0
Intro Protein structure Motifs Motif databases End
PSSM: Better than profile
• Want a log-odds score!• PSSM=Position Specific Scoring Matrix• Mr ,c = 10 log2
(Fr ,c/πr
), where πr is
frequency of r in our data.
• Let πA = πC = πG = πT = 0.25.• MA,1 = 10 log2(FA,1/0.25) =
10 log2(0.6/0.25) = 12.6• MC,2 = 10 log2(FC,2/0.25) =
10 log2(0.25/0.25) = 0
Intro Protein structure Motifs Motif databases End
PSSM: Better than profile
• Want a log-odds score!• PSSM=Position Specific Scoring Matrix• Mr ,c = 10 log2
(Fr ,c/πr
), where πr is
frequency of r in our data.• Let πA = πC = πG = πT = 0.25.
• MA,1 = 10 log2(FA,1/0.25) =10 log2(0.6/0.25) = 12.6
• MC,2 = 10 log2(FC,2/0.25) =10 log2(0.25/0.25) = 0
Intro Protein structure Motifs Motif databases End
PSSM: Better than profile
• Want a log-odds score!• PSSM=Position Specific Scoring Matrix• Mr ,c = 10 log2
(Fr ,c/πr
), where πr is
frequency of r in our data.• Let πA = πC = πG = πT = 0.25.• MA,1 = 10 log2(FA,1/0.25) =
10 log2(0.6/0.25) = 12.6
• MC,2 = 10 log2(FC,2/0.25) =10 log2(0.25/0.25) = 0
Intro Protein structure Motifs Motif databases End
PSSM: Better than profile
• Want a log-odds score!• PSSM=Position Specific Scoring Matrix• Mr ,c = 10 log2
(Fr ,c/πr
), where πr is
frequency of r in our data.• Let πA = πC = πG = πT = 0.25.• MA,1 = 10 log2(FA,1/0.25) =
10 log2(0.6/0.25) = 12.6• MC,2 = 10 log2(FC,2/0.25) =
10 log2(0.25/0.25) = 0
Intro Protein structure Motifs Motif databases End
PSSM M from our profile FProfile:
Pos: 1 2 3 4 5 6A 0.6 0.15 0.0 0.2 0.0 0.0C 0.0 0.25 1.0 0.4 0.0 0.55G 0.3 0.25 0.0 0.4 0.5 0.0T 0.1 0.35 0.0 0.0 0.5 0.45
PSSM:Pos: 1 2 3 4 5 6A 12.6 -7.4 −∞ -3.2 −∞ −∞C −∞ 0.0 20.0 6.8 −∞ 11.4G 2.63 0.0 −∞ 6.8 10.0 −∞T -13.2 4.9 −∞ −∞ 10.0 8.5
Score for AACATT:
12.6− 7.4 + 20.0− 3.2 + 10.0 + 8.5 = 40.5
Intro Protein structure Motifs Motif databases End
PSSM M from our profile FProfile:
Pos: 1 2 3 4 5 6A 0.6 0.15 0.0 0.2 0.0 0.0C 0.0 0.25 1.0 0.4 0.0 0.55G 0.3 0.25 0.0 0.4 0.5 0.0T 0.1 0.35 0.0 0.0 0.5 0.45
PSSM:Pos: 1 2 3 4 5 6A 12.6 -7.4 −∞ -3.2 −∞ −∞C −∞ 0.0 20.0 6.8 −∞ 11.4G 2.63 0.0 −∞ 6.8 10.0 −∞T -13.2 4.9 −∞ −∞ 10.0 8.5
Score for AACATT:
12.6− 7.4 + 20.0− 3.2 + 10.0 + 8.5 = 40.5
Intro Protein structure Motifs Motif databases End
Generalizing with PSSM?
How handle a new variant of a motif?Pos: 1 2 3 4 5 6A 12.6 -7.4 −∞ -3.2 −∞ −∞C −∞ 0.0 20.0 6.8 −∞ 11.4G 2.63 0.0 −∞ 6.8 10.0 −∞T -13.2 4.9 −∞ −∞ 10.0 8.5
Score for ATCTTT?
12.6 + 4.9 + 20.0−∞+ 10.0 + 8.5 = 56−∞
Intro Protein structure Motifs Motif databases End
Generalizing with PSSM?
How handle a new variant of a motif?Pos: 1 2 3 4 5 6A 12.6 -7.4 −∞ -3.2 −∞ −∞C −∞ 0.0 20.0 6.8 −∞ 11.4G 2.63 0.0 −∞ 6.8 10.0 −∞T -13.2 4.9 −∞ −∞ 10.0 8.5
Score for ATCTTT?
12.6 + 4.9 + 20.0−∞+ 10.0 + 8.5 = 56−∞
Intro Protein structure Motifs Motif databases End
”Pseudo counts” for profiles• Idea: Pretend you have seen all possible
motifs
• Pseudo counts: αr is number of ”pseudoobservations” of r .
• Include in profile calculations:
Fr ,c =nr ,c
+ αr
n
+∑
r αr
• Example 1: Let αA = αC = αG = αT = 1.FA,1 = 12+1
20+4 = 0.54.• Example 2: We had nC,1 = 0.
FC,1 = 0+120+4 = 0.042
• Result: Can use PSSM to find novel motifs
Intro Protein structure Motifs Motif databases End
”Pseudo counts” for profiles• Idea: Pretend you have seen all possible
motifs• Pseudo counts: αr is number of ”pseudo
observations” of r .
• Include in profile calculations:
Fr ,c =nr ,c + αr
n +∑
r αr
• Example 1: Let αA = αC = αG = αT = 1.FA,1 = 12+1
20+4 = 0.54.• Example 2: We had nC,1 = 0.
FC,1 = 0+120+4 = 0.042
• Result: Can use PSSM to find novel motifs
Intro Protein structure Motifs Motif databases End
”Pseudo counts” for profiles• Idea: Pretend you have seen all possible
motifs• Pseudo counts: αr is number of ”pseudo
observations” of r .• Include in profile calculations:
Fr ,c =nr ,c + αr
n +∑
r αr
• Example 1: Let αA = αC = αG = αT = 1.FA,1 = 12+1
20+4 = 0.54.• Example 2: We had nC,1 = 0.
FC,1 = 0+120+4 = 0.042
• Result: Can use PSSM to find novel motifs
Intro Protein structure Motifs Motif databases End
”Pseudo counts” for profiles• Idea: Pretend you have seen all possible
motifs• Pseudo counts: αr is number of ”pseudo
observations” of r .• Include in profile calculations:
Fr ,c =nr ,c + αr
n +∑
r αr
• Example 1: Let αA = αC = αG = αT = 1.FA,1 = 12+1
20+4 = 0.54.
• Example 2: We had nC,1 = 0.FC,1 = 0+1
20+4 = 0.042• Result: Can use PSSM to find novel motifs
Intro Protein structure Motifs Motif databases End
”Pseudo counts” for profiles• Idea: Pretend you have seen all possible
motifs• Pseudo counts: αr is number of ”pseudo
observations” of r .• Include in profile calculations:
Fr ,c =nr ,c + αr
n +∑
r αr
• Example 1: Let αA = αC = αG = αT = 1.FA,1 = 12+1
20+4 = 0.54.• Example 2: We had nC,1 = 0.
FC,1 = 0+120+4 = 0.042
• Result: Can use PSSM to find novel motifs
Intro Protein structure Motifs Motif databases End
”Pseudo counts” for profiles• Idea: Pretend you have seen all possible
motifs• Pseudo counts: αr is number of ”pseudo
observations” of r .• Include in profile calculations:
Fr ,c =nr ,c + αr
n +∑
r αr
• Example 1: Let αA = αC = αG = αT = 1.FA,1 = 12+1
20+4 = 0.54.• Example 2: We had nC,1 = 0.
FC,1 = 0+120+4 = 0.042
• Result: Can use PSSM to find novel motifs
Intro Protein structure Motifs Motif databases End
Fast motif searches
• Motifs are small, therefore easy to searchwith. Fast.
• Blast variants exists for motifs.• E-value theory same thanks to log-odds
score!
Intro Protein structure Motifs Motif databases End
Fast motif searches
• Motifs are small, therefore easy to searchwith. Fast.
• Blast variants exists for motifs.
• E-value theory same thanks to log-oddsscore!
Intro Protein structure Motifs Motif databases End
Fast motif searches
• Motifs are small, therefore easy to searchwith. Fast.
• Blast variants exists for motifs.• E-value theory same thanks to log-odds
score!
Intro Protein structure Motifs Motif databases End
Motif databases
PROSITE: Important binding sites”What motifs does my protein have?”• Profiles• Pattern notation• Careful documentation
BLOCKS: Origin to BLOSUM.Presents multialignments!Assembled by most conserved partsof domains.
PRINTS: ”What motif combinations does myprotein have?”
Intro Protein structure Motifs Motif databases End
Motif databases
PROSITE: Important binding sites”What motifs does my protein have?”• Profiles• Pattern notation• Careful documentation
BLOCKS: Origin to BLOSUM.Presents multialignments!Assembled by most conserved partsof domains.
PRINTS: ”What motif combinations does myprotein have?”
Intro Protein structure Motifs Motif databases End
Motif databases
PROSITE: Important binding sites”What motifs does my protein have?”• Profiles• Pattern notation• Careful documentation
BLOCKS: Origin to BLOSUM.Presents multialignments!Assembled by most conserved partsof domains.
PRINTS: ”What motif combinations does myprotein have?”