protein structure analysis - ii

Protein Structure Analysis - II

Liangjiang (LJ) Wang

[email protected]

April 10, 2005

PLPTH 890 Introduction to Genomic Bioinformatics

Lecture 23

Outline• Protein structure alignment (DALI and

VAST).

• Protein secondary structure prediction (PHDsec, PSIPRED, etc).

• Prediction of 3-D protein structures:

– Homology modeling.

– Threading.

– Ab initio prediction.

• Protein structural genomics.

Protein Structure Comparison• Why is structure comparison important?

– To understand structure-function relationship.– To study the evolution of many key proteins

(structure is more conserved than sequence).

• Comparing 3-D structures is much more difficult than sequence comparison.

• Protein structure classification:– SCOP: Structure Classification Of Proteins.– CATH: Class, Architecture, Topology and

Homology.

• Protein structure alignment: DALI and VAST.

Protein Structure Alignment• Positions of atoms in two or more 3-D protein

structures are compared.

• Must first determine which atoms to align. At least two sets of three common reference points should be identified.

• Atoms in structures are matched to minimize the average deviation.

• Computers are NOT good at comparing 3-D objects (an NP-hard problem).

(Baxevanis and Ouellette, 2005)

How to Compare Structures?

Structure 1 Structure 2

Description 1 Description 2

Scores

Similarity, classification

Feature extraction

Comparison

Statistical analysis

DALI• DALI is for Distance matrix ALIgnment.

• Each structure is represented as a two-dimensional array (matrix) of distances between all pairs of C atoms.– Remember what a C atom is?

• Assume that similar 3-D structures have similar inter-residue distances.

• DALI uses distance matrices to align protein structures.

• DALI is available at http://www.ebi.ac.uk/dali/.

VAST• VAST is for Vector Alignment Search Tool.

• Each structure is represented as a set of secondary structure elements (SSEs).– SSEs: helices or strands.

• VAST scores pairs of SSEs based on their type, orientation and connectivity.

• The SSE matches of statistical significance are then extended (similar to BLAST).

• Structures in MMDB have been pre-computed, and organized as structure neighbors in Entrez.

• VAST can be accessed at http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml.

Secondary Structure Prediction• Given the sequence of a polypeptide,

secondary structures are predicted.

• Assume that secondary structures are fully determined by local interactions among neighboring residues.

• Early analysis were based on the frequencies of amino acid found in different types of secondary structures.– For example, proline occurs at turns, but not in helices.

• Modern approaches use machine learning techniques and multiple sequence alignments.

Machine Learning Approach

QEALDAAGDKLVVVDF

HHHHHHLLLLEEEEEE

H – HelixE – SheetL – Loop

Training Dataset Test Dataset

Classifier (Model)

Performance?

Training

Testing

YesPrediction

No

PHDsec• For a given protein sequence:

– Search for homologous sequences.– Produce a multiple sequence alignment.– Generate a profile (evolutionary information).

• PHDsec uses a feed-forward artificial neural network to predict the secondary structures.

RAP SS K Y

EH L

Input layer

Hidden layer

Output layer

(PHDsec can be accessed at http://www.predictprotein.org/)

http://www.predictprotein.org/



PSIPRED

• For a given protein sequence:

– Perform a PSI-BLAST search.

– Create a profile that conveys the evolutionary information at each position.

– Feed the profile into a system of neural networks (or support vector machines).

• PSIPRED can be accessed at http://bioinf.cs.ucl.ac.uk/psipred/.

http://bioinf.cs.ucl.ac.uk/psipred/



How to Evaluate the Performance?• EVA: an independent server for evaluation

of protein structure prediction methods.

• The best tool for three-stateper-residuesecondary structure predictionnow reachesthe accuracyof about 78%.

(http://cubic.bioc.columbia.edu/eva/)

http://cubic.bioc.columbia.edu/eva/



Prediction of 3-D Protein Structures• There are about 30,000 structures in PDB, but

more than 1.8 million non-redundant protein sequences in UniProt (Swiss-Prot + TrEMBL).

• Computational structure prediction may provide valuable information for most of the protein sequences derived from genome sequencing projects.

• Three predictive methods:

– Homology (or comparative) modeling.

– Threading (or fold recognition).

– Ab initio structure prediction.

Sequence - Structure Relationship• In cells, protein folding is determined by the

amino acid sequence. But, protein structures can also be affected by post-translational modifications and the cellular environment.

• Proteins with ≥ 30% sequence identity tend to have similar structures. However, exceptions do exist …

(Viral capsid protein, 1PIV:1) (Glycosyltransferase, 1HMP:A)

80-residue stretch (yellow) with 40% sequence identity

(Bourne, 2004)

Homology Modeling • Probably the most accurate method for

protein structure prediction.

• Five different steps:– Find a known structure related to the query

sequence by sequence comparison.– Align the query sequence with the known

structure (template).– Build a model by modifying the backbone and

side chains of the template.– Refine the model using energy minimization.– Validate the model using visual inspection or

software tools.

Homology Modeling (Cont’d) • Accuracy of structure prediction depends on

the percent amino acid sequence identity shared between the query and template.

• For >50% sequence identity, RMSD (Root Mean Square Deviation) is only 1 Å for main-chain atoms, which is comparable to the accuracy of a medium-resolution NMR structure or a low-resolution X-ray structure.

• Homology modeling may not be used for predicting protein structures if the sequence identity is less than 30%.

Homology Modeling Servers• SWISS-MODEL (http://swissmodel.expasy.org/): A

popular site for structure homology modeling.

• SDSC1 (http://cl.sdsc.edu/hm.html): the #1 ranked server for homology modeling on the EVA site.

SDSC1


http://swissmodel.expasy.org/

http://cl.sdsc.edu/hm.html

Threading (Baxevanis and Ouellette, 2005)

Threading (Cont’d)• Threading takes a query sequence and passes

(threads) it through the 3-D structure of each protein in a fold database (known structures).

• As a sequence is threaded, the fit of the sequence in the fold is evaluated using some functions of energy or packing efficiency.

• Threading may find a common fold for proteins with essentially no sequence homology.

• Structures predicted from threading techniques often are not of high quality (RMSD > 3 Å).

• Based on EVA results, 3D-PSSM is the best threading server (http://www.sbg.bio.ic.ac.uk/~3dpssm/).

Ab Initio Structure Prediction• Ab initio prediction can be used when a protein

sequence has no detectable homologues in PDB.

• Protein folding is modeled based on global free-energy minimization.

• Since the protein folding problem has not yet been solved, the ab initio prediction methods are still experimental and can be quite unreliable.

• One of the top ab initio prediction methods is called Rosetta, which was found to be able to successfully predict 61% of structures (80 of 131) within 6.0 Å RMSD (Bonneau et al., 2002).

• The HMMSTR/Rosetta Server can be accessed at http://www.bioinfo.rpi.edu/~bystrc/hmmstr/server.php.

Comparing Structure Prediction Methods

A – C: homology modeling with 60% (A), 40% (B) and 30% (C) sequence identity.

D and E: ab initio protein structure prediction.

Predicted structures are in red, and actual structures are in blue.

(Baker and Sali, 2000)

Example: Cysteine-Rich Peptides

C C C C

C

Signal helix and cleavage site

C C C

NCR: Nodule-specific Cysteine Rich genes in legumes.Avr9: fungal avirulence protein from Cladosporium fulvum.Defensin: antimicrobial peptides.Proteinase inhibitor: Serine proteinase inhibitors.SCR6: S-locus of Brassica, SI, interact with SRK6.

Ab Initio Prediction of Cys Rich Peptides

LSG-TC51151 PsENOD3

Defensin (AAG40321, M. sativa) Avr9 (Cladosporium fulvum)

Protein Structural Genomics • A worldwide initiative aimed at determining a

large number of protein structures in a high throughput mode.

• In the US, nine structural genomics centers have been funded by the National Institutes of Health (NIH).

• More information may be found at http://www.rcsb.org/pdb/strucgen.html.

• TargetDB (http://targetdb.pdb.org/): a centralized registration database for target sequences from the worldwide structural genomics projects.

A Target Selection Pipeline from JCSG

Methods TMHMM

Protein size(7 - 80 kDa) Low complexity Redundancy

BLAST against PDB sequences

Summary• Fast and accurate structure alignment is still

a very hard problem to be solved.

• Machine learning techniques are widely used in protein secondary structure prediction.

• Homology modeling is probably the most reliable method for structure prediction.

• The protein folding problem has not yet been solved.

Prediction of Solvent Accessibility• Solvent accessibility: the relative area of a

residue’s surface that is exposed to the surrounding solvent.

• The solvent-accessible residues may be part of an active site or a binding site, while the buried residues may play an important role in stabilizing the protein structure.

• PHDacc (http://www.predictprotein.org/): a neural network-based method (similar to PHDsec).

• Jpred (http://www.compbio.dundee.ac.uk/~www-jpred/): a neural network system that predicts both secondary structure and solvent accessibility.




http://www.compbio.dundee.ac.uk/~www-jpred/

http://www.compbio.dundee.ac.uk/~www-jpred/

Predicting Transmembrane Segments• Transmembrane segments share common

biophysical features (e.g., hydrophobicity).

• PHDhtm (http://www.predictprotein.org/):– Part of the PredictProtein services.– Transmembrane helices are predicted using a

neural network system.

• TMHMM (http://www.cbs.dtu.dk/services/TMHMM/):– A set of known transmembrane segments are

represented as HMMs.– A query sequence is matched to a known

transmembrane pattern.




http://www.cbs.dtu.dk/services/TMHMM/



Signal Peptide Prediction• Extracellular proteins or proteins targeted to

subcellular compartments contain short signal peptides (often at the N-terminal).

• PSORT (http://psort.ims.u-tokyo.ac.jp/): A rule-based expert system for predicting subcellular localization of proteins from their amino acid sequences. The algorithm of k-nearest neighbors is used for reasoning.

• SignalP (http://www.cbs.dtu.dk/services/SignalP/): predicts the presence and location of signal peptide cleavage sites using a combination of neural networks and HMMs.

http://psort.ims.u-tokyo.ac.jp/

http://www.cbs.dtu.dk/services/SignalP/

protein structure analysis - ii

Documents

d protein structures

structure neighbors

structure predictiongiven

d structures

given protein sequence

structure comparison

key proteins structure

protein structural genomics