1 protein structure prediction (lecture for cs397-cxz algorithms in bioinformatics) april 23, 2004...
TRANSCRIPT
![Page 1: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/1.jpg)
1
Protein Structure Prediction
(Lecture for CS397-CXZ Algorithms in Bioinformatics)
April 23, 2004
ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign
![Page 2: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/2.jpg)
2
Topics in Bioinformatics
> DNA sequenceAATTCATGAAAATCGTATACTGGTCTGGTACCGGCTGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCTGGTAAAGACGTCAACACCATCAACGTGTCACATCGATGAACTGCTGAACGAAGATATCCTGTTGCTCTGCCATGGGCGATGAAGTTCTCGAGG
> Protein sequenceMKIVYWSGTGNTEKMAELIAKGIIESGKDVDELLNEDILILGCSAMGDEVLEESEFEPFIEKVALFGSYGWGDGKWMRDFEERMNGYGPDEAEQDCIEFGKKIANI
Gene (DNA) Function (Protein)
Gene expression& regulation
Microarray data(Matrix)
Genomics Proteomics
transcriptomics
![Page 3: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/3.jpg)
3
Proteomics: Protein Sequence Analysis
• Determine protein sequences (primary structure)– Indirect: Find genes and then translate them to proteins
– Direct: Mass spectrometry data
• Determine 3-D protein structures (secondary, tertiary, quaternary)– Computational: Sequence matching, energy minimization etc.
– Experimental: X-ray Crystallography, Nuclear Magnetic Resonance spectroscopy (NMR), Electron Microscopy/Diffraction
• Determine protein functions– Computational: Profile HMMs, protein classification, motif
analysis
– Experimental: Web lab experiments
• Determine protein-protein interactions– Gene network finding (time series microarray data)
– Metabolic engineering
![Page 4: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/4.jpg)
4
Basics of Protein Structures…
![Page 5: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/5.jpg)
5
The Building Blocks (Amino Acids)
![Page 6: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/6.jpg)
6
The 20 Amino AcidsThe 20 Amino Acids
![Page 7: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/7.jpg)
7
Protein structure hierarchical levels
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
PRIMARY STRUCTURE (amino acid sequence)
QUATERNARY STRUCTURE (oligomers)
SECONDARY STRUCTURE (helices, strands)
TERTIARY STRUCTURE (fold)(Adapted from Jaap Heringa’s slide)
-helix
-sheet
loop/coil
![Page 8: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/8.jpg)
8
Domain and Folds
• A discrete portion of a protein assumed to fold independently of the rest of the protein and possessing its own function.
• Most proteins have multi-domains.
• The core 3D structure of a domain is called a fold. There are only a few thousand possible folds.
![Page 9: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/9.jpg)
9
Examples of fold classes (CATH architectures)
![Page 10: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/10.jpg)
10
Protein Structure & Function
sequence
structure
function
medicine
Most functionsdepend on structures
![Page 11: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/11.jpg)
11
Structure Prediction Methods
(Adapted from a slide by P. Johansson, E. Jakobsson)
Homology modelingHigh sequence similarity
(> 30% identity)Exploit known whole structure
Fold RecognitionMedium sequence similarity
(generally < 30% identity)Exploit known partial structures
(e.g., known folds, secondary structures)
Ab InitioLow sequence similarity
Use “first principles” (e.g., energy minimization)
![Page 12: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/12.jpg)
12
First, suppose we have high similarity…
![Page 13: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/13.jpg)
13
Homology Modeling
• Simplest, reliable approach
• Basis: proteins with similar sequences tend to fold into similar structures
• Has been observed that even proteins with 30% sequence identity fold into similar structures
• Does not work for remote homologs (< 30% pairwise identity)
![Page 14: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/14.jpg)
14
Homology Modeling (cont.)
• Given:
– A query sequence Q
– A database of known protein structures
• Find protein P such that P has high sequence similarity to Q
– Based on sequence alignment (tuned for protein structure matching, less penalty for gaps)
– HMMs, BLAST, etc.
• Return P’s structure as an approximation to Q’s structure
![Page 15: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/15.jpg)
15
Now, if we don’t have high similarity, but we have medium
similarity…
![Page 16: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/16.jpg)
16
Threading (Fold Recognition)
• Given:
– Sequence of protein P with unknown structure
– Database of known folds (overall structures)
• Find:
– Most plausible fold for P
– Evaluate quality of such arrangement
• Places the residues of unknown P along the backbone of a known structure and determines stability of side chains in that arrangement
![Page 17: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/17.jpg)
17
What if we have really low similarity?
![Page 18: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/18.jpg)
18
Secondary Structure Prediction
• Given an amino acid sequence
• Predict a secondary structure state (, , coil) for each residue in the sequence
• Secondary structures can help– Determine 3D structures (e.g., help threading)
– Provide insights about functions
• Evaluation: Q3 = percentage of correct assignments
• Accuracy – 64% -75% based on primary sequence only (recent
methods perform better)
– Higher accuracy for -helices than strands
– Accuracy is dependent on protein family
![Page 19: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/19.jpg)
19
Typical Secondary Structure Prediction Results
![Page 20: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/20.jpg)
20
Secondary Structure Prediction Methods
• Early approaches (Chou and Fasman 1978)
– Make prediction for a given residue by considering a window of n (13 – 21) neighboring residues
– Learn model that performs mapping from window of residues to secondary structure state
• Later methods utilize evolutionary information (e.g., PHD system (Rost & Sander, 1993) ) and consider related sequences when making prediction
• Most recent approaches: Neural networks (PSIPRED, 77%) (Altschul et al., 1997)
![Page 21: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/21.jpg)
21
Chou-Fasman Method
• Developed by Chou & Fasman in 1974 & 1978
• Based on frequencies of residues in -helices, -sheets and turns
• Assumptions:– The entire information for forming secondary structure is
contained in the primary sequence
– Side groups of residues will determine structure
– Examining windows of 13 - 17 residues is sufficient to predict structure
– Basis for window size selection: -helices 5 – 40 residues long -strands 5 – 10 residues long
• Accuracy ~50 - 60% Q3
![Page 22: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/22.jpg)
22
Chou-Fasman Pij-valuesName P(H) P(E) P(turn)
Alanine 142 83 66
Arginine 98 93 95
Aspartic Acid 101 54 146
Asparagine 67 89 156
Cysteine 70 119 119
Glutamic Acid 151 37 74
Glutamine 111 110 98
Glycine 57 75 156
Histidine 100 87 95
Isoleucine 108 160 47
Leucine 121 130 59
Lysine 114 74 101
Methionine 145 105 60
Phenylalanine 113 138 60
Proline 57 55 152
Serine 77 75 143
Threonine 83 119 96
Tryptophan 108 137 96
Tyrosine 69 147 114
Valine 106 170 50
Values indicate how likely an amino acid occurs in one secondary structure
as opposed to others
![Page 23: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/23.jpg)
23
Improved Chou-Fasman
1. Assign all of the residues the appropriate set of parameters
2. Identify -helix and -sheet regions. Extend the regions in both directions.
3. If structures overlap compare average values for P(H) and P(E) and assign secondary structure based on best scores.
4. Turns are modeled as tetrapeptides using 2 different probability values.
![Page 24: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/24.jpg)
24
Assign Pij values
1. Assign all of the residues the appropriate set of parameters
T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57P(E) 147 75 55 147 83 37 130 105 93 75 147 75
P(turn) 114 143 152 114 66 74 59 60 95 143 114 156
![Page 25: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/25.jpg)
25
Scan peptide for helix regions
2. Identify regions where 4/6 have a
P(H) >100 “alpha-helix nucleus”
T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57
T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57
![Page 26: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/26.jpg)
26
Extend -helix nucleus
3. Extend helix in both directions until a set of four residues have an average P(H) <100.
T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57
Repeat steps 1 – 3 for entire peptide
![Page 27: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/27.jpg)
27
Scan peptide for -sheet regions
4. Identify regions where 3/5 have a
P(E) >100 “-sheet nucleus”
5. Extend -sheet until 4 continuous residues an have an average P(E) < 100
6. If region average > 105 and the average P(E) > average P(H) then “-sheet”
T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57P(E) 147 75 55 147 83 37 130 105 93 75 147 75
![Page 28: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/28.jpg)
28
Visit
http://fasta.bioch.virginia.edu/fasta_www/chofas.htm
![Page 29: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/29.jpg)
29
Neural Network Predictors
• All current state of the art methods for secondary structure prediction (except consensus methods) employ neural network classifiers.
• (Large) data sets are used to train the neural net
• A sequence window centered on the amino acid to predict is presented to the classifier
• Homologous sequences (e.g. -Blast profile) are used to augment prediction capability
![Page 30: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/30.jpg)
30
What about exploit physical principles?
![Page 31: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/31.jpg)
31
Ab Initio Prediction
Solve a complex optimization Problem:- Measure “goodness” based on energy etc- Randomly start with some conformation- Heuristically propose a next conformation
- Search for the best conformation
![Page 32: 1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f265503460f94c3da10/html5/thumbnails/32.jpg)
32
Best so far…
http://depts.washington.edu/bakerpg/
Using Rosetta for Ab Initio Structure Prediction in the Fourth Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP4)
Group of David Baker, Univ. of Washington
Visit their website and read the paper if you are interested…