bioinformatics of proteins: sequence, structure and the ‘symbiosis’ between them

138
Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them Maya Schushan The Ben-Tal lab

Upload: foy

Post on 11-Jan-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them. Maya Schushan The Ben-Tal lab. Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them. OUTLINE. Sequence: Databases, domains, motifs & annotations Structure: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Bioinformatics of proteins:Sequence, structure and the ‘symbiosis’ between

them

Maya SchushanThe Ben-Tal lab

Page 2: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Bioinformatics of proteins:

Sequence, structure and

the ‘symbiosis’

between them

Page 3: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

OUTLINE

• Sequence:Databases, domains, motifs & annotations

• Structure:Secondary structure, structure databases, visualization and identification of functional site

Page 4: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

UniProt• UniProt is a collaboration between the

European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR).

• In 2002, the three institutes decided to pool their resources and expertise and formed the UniProt Consortium.

Sequences, domains, motifs & annotations

Page 5: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Sequences, domains, motifs & annotations

UniProt• The world's most comprehensive catalog of

information on proteins

• Sequence, function & more…

• Comprised mainly of the databases:

– SwissProt – 366226 last year, 412525 protein entries now –high quality annotation, non-redundant & cross-referenced to many other databases.

– TrEMBL - 5708298 last year, 7341751 protein entries now – computer translation of the genetic information from the EMBL Nucleotide Sequence Database many proteins are poorly annotated since only automatic annotation is generated

Page 6: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

UniProt

• Annotation description includes:– Function(s) of the protein; – Posttranslational modification(s) such as carbohydrates,

phosphorylation, acetylation and GPI-anchor; – Domains and sites, for example, calcium-binding regions, ATP-

binding sites, zinc fingers, homeoboxes, – Secondary structure, e.g. alpha helix, beta sheet; – Quaternary structure, i.g. homodimer, heterotrimer, etc.; – Similarities to other proteins; – Disease(s) associated with any number of deficiencies in the

protein; – Sequence conflicts, variants, etc

Sequences, domains, motifs & annotations

Page 7: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

UniProt

• Connected to many other databases(e.g. Pfam , Prosite, EC, GO, PdbSum, PDB (to be discussed…))

• Each sequence has a unique 6 letter accession

• Entries in SwissProt also have IDs, which usually make sense(e.g. CADH1_HUMAN for a cadherin of humans)

• Download sequence in FASTA format

Sequences, domains, motifs & annotations

Page 8: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

UniProt: http://www.uniprot.org/

Type accession: P05102 Or ID:

MTH1 _HAEPH

Sequences, domains, motifs & annotations

Page 9: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Sequences, domains, motifs & annotations

Page 10: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

General data: name, origin, EC (enzymatic reaction)…

Sequences, domains, motifs & annotations

Page 11: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Scroll down to find the sequence & download the FASTA

Functional data, including the GO annotations

Sequences, domains, motifs & annotations

Page 12: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Known sites, predicted/known secondary structures,Natural variation or mutagenesis

Sequences, domains, motifs & annotations

Page 13: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

The protein’s sequence in FASTA format

Download

Send to BLAST

Sequences, domains, motifs & annotations

Page 14: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

References for all info in the page- important to take a look…

Sequences, domains, motifs & annotations

Page 15: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Connections to other databases

Other sequence database, e.g. genebank

Related structures in the PDB (if available)

Model-structure in the ModBase database-

automatically derived!

All sorts of domain\motifs databases -

The family related to the entry

Sequences, domains, motifs & annotations

Page 16: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Sequences, domains, motifs & annotations

Pfam- domain database

• Proteins are generally composed of one or more functional regions, commonly termed domains.

• Different combinations of domains give rise to the diverse range of proteins found in nature.

• The identification of domains that occur within proteins can therefore provide insights into their function.

Page 17: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Sequences, domains, motifs & annotations

Pfam- domain database• The Pfam database is a large collection of protein

domainfamilies.

• Each family is represented by multiple sequence alignmentsand hidden Markov models (HMMs).

• Pfam entries are classified in one of four ways: Family: A collection of related proteinsDomain: A structural unit which can be found in multiple protein contextsRepeat: A short unit which is unstable in isolation but

forms a stable structure when multiple copies are presentMotifs: A short unit found outside globular domains

Page 18: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Sequences, domains, motifs & annotations

Pfam- domain databaseThere are two components to Pfam:• Pfam-A entries are high quality, manually curated

families. these Pfam-A entries cover a large proportion of the sequences in the sequence database.

• Pfam-B- automatically generated entries. Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found.

• Pfam also generates higher-level groupings of related families, known as clans. A clan is a collection of Pfam-A entries which are related by similarity of sequence, structure or profile-HMM.

Page 19: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Sequences, domains, motifs & annotations

Pfam- domain databaseAllows http://pfam.sanger.ac.uk/ :

•Analyze your protein sequence for Pfam matches

•View Pfam family annotation and alignments

•See groups of related families

•Look at the domain organization of a protein sequence

•Find the domains on a PDB structure

•Query Pfam by keyword

Page 20: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Sequences, domains, motifs & annotations

Pfam- domain databaseSearching for a certain protein accession

Page 21: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Sequences, domains, motifs & annotations

Pfam- domain databaseSearching for a certain protein accession

Page 22: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Sequences, domains, motifs & annotations

Pfam- domain database

Page 23: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Sequences, domains, motifs & annotations

Page 24: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Sequences, domains, motifs & annotations

Classifying protein function

• Each protein performs one (or more…) specific functions. This can be, e.g., catalyzation of a specific enzymatic reaction, transport of an ion, interaction with a DNA molecule etc…

• In order to easily address the specific functions, attempts have been made to numerate and classify the various functions performed by proteins.

Page 25: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Sequences, domains, motifs & annotations

Classifying protein function

Example-

some of the diversefunctions exhibited byMembrane proteins.

Page 26: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Sequences, domains, motifs & annotations

Enzyme Commission number (EC number)

• A numerical classification scheme for enzymes, based on the chemical reactions they catalyze

• EC numbers do not specify enzymes, but enzyme-catalyzed reactions. If different enzymes (for instance from different organisms) catalyze the same reaction, then they receive the same EC number.

• By contrast, the UniProt database identifiers uniquely specify a protein by its amino acid sequence.

Page 27: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Sequences, domains, motifs & annotations

Enzyme Commission number (EC number)

• Every enzyme code consists of the letters "EC" followed by four numbers separated by periods. Those numbers represent a progressively finer classification of the enzyme.

• For example, the tripeptide aminopeptidases have the code "EC 3.4.11.4":• EC 3 enzymes are hydrolases (enzymes that use water to break up some other molecule)• EC 3.4 are hydrolases that act on peptide bonds•EC 3.4.11 are those hydrolases that cleave off the amino-terminal amino acid from a polypeptide•EC 3.4.11.4 are those that cleave off the amino-terminal end from a tripeptide

Page 28: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Sequences, domains, motifs & annotations

Enzyme Commission number (EC number)

• For example, the tripeptide aminopeptidases have the code "EC 3.4.11.4“, as shown for an enzyme from

Lactobacillus helveticus in the BRENDA database for Comprehensive Enzyme Information System:  

Page 29: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Sequences, domains, motifs & annotations

Enzyme Commission number (EC number)

• EC 1 - Oxidoreductases• EC 2 - Transferases• EC 3 - Hydrolases• EC 4 - Lyases• EC 5 - Isomerases• EC 6 - Ligases

Page 30: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Sequences, domains, motifs & annotations

Gene Ontology• A collaborative effort to address the need for

consistent descriptions of gene products in different database

• The GO project has developed three structured controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner.

• The use of GO terms by collaborating databases facilitates uniform queries across them. The controlled vocabularies are structured so that they can be queried at different levels.

Page 31: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Sequences, domains, motifs & annotations

Gene OntologyCellular componentA cellular component is just that, a component

of a cell, but that it is part of some larger object;

this may be an anatomical structure (e.g. rough endoplasmic reticulum or nucleus) or a gene product group (e.g. ribosome, proteasome or a protein dimer)

Page 32: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Sequences, domains, motifs & annotations

Gene OntologyCellular componentA cellular component is just that, a component

of a cell, but that it is part of some larger object;

this may be an anatomical structure (e.g. rough endoplasmic reticulum or nucleus) or a gene product group (e.g. ribosome, proteasome or a protein dimer)

Page 33: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Sequences, domains, motifs & annotations

Gene OntologyBiological processA biological process is series of events

accomplished by one or more ordered assemblies of molecular functions.

Examples of biological process terms are signal transduction or pyrimidine metabolism.

It can be difficult to distinguish between a biological process and a molecular function, but the general rule is that a process must have more than one distinct steps.

Page 34: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Sequences, domains, motifs & annotations

Gene OntologyMolecular functiondescribes activities, such as catalytic or binding activities, that occur at the molecular level.

Molecular functions generally correspond to activities that can be performed by individual gene products, but some activities are performed by assembled complexes of gene products.

Examples of broad functional terms are catalytic activity, transporter activity, or binding; examples of narrower functional terms are adenylate cyclase activity or Toll receptor binding.

Page 35: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Sequences, domains, motifs & annotations

Gene OntologyTopologyThe ontologies are in the form of directed acyclic graphs

(DAG), with the graph nodes being GO terms.

The ontologies are hierarchically structured, a more specialized term (child) can be related to more than one less specialized term (parent).

E.g. the biological process hexose biosynthetic process has two parents, hexose metabolic process and monosaccharide biosynthetic process. biosynthetic process is a type of metabolic process and a hexose is a type of monosaccharide. When any gene is involved in hexose biosynthetic process, it is automatically annotated to both hexose metabolic process and monosaccharide biosynthetic process.

Page 36: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Sequences, domains, motifs & annotations

Gene Ontology Example

Page 37: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Sequences, domains, motifs & annotations

Gene Ontology InterfaceSearch by gene or protein accession

http://www.geneontology.org/

Page 38: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Sequences, domains, motifs & annotations

Summary of the first part- protein sequence databases and tools

• UniProt- the most comprehensive protein sequence database. Connected to many other databases and resources,

• Pfam- domain database. Many others… interpor, prosite, BLOCKS etc.

• EC and GO classifications of protein function

Page 39: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

OUTLINE

• Sequence:Databases, domains, motifs & annotations

• Structure:Secondary structure, structure databases, visualization and identification of functional site

Page 40: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

• All information about the native structure of a protein is encoded in the amino acid sequence + its native solution environment.

• Many possible conformation still only one or few native folds are exhibited for each protein (Levinthal’s paradox)

• Protein folding is driven by various forces:– Ionic forces– Hydrogen bonds– The hydrophobic affect– . . .

From Sequence to Structure

Investigating & visualizing protein structures

Page 41: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Investigating & visualizing protein structures

Secondary Structure Prediction

Why predict secondary structures of proteins?

1)When the structure of the protein is still unknown. This can serve as the first step for structure prediction- first predict the secondary structures, then how they are arranged together.

2) For calculating better multiple sequence alignments or pairwise alignments.

Page 42: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Predicting 2° Structure Each amino acid has a

different propensity for being in each 2° structure.

For example, Proline causes a kink which destroys the helix structure. Thus, Proline is usually found only at the helix end.

The different structures also have typical lengths.

Investigating & visualizing protein structures

Page 43: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

http://www.predictprotein.org/

Predicting 2° Structure

Investigating & visualizing protein structures

Page 44: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Predicting 2° StructureAll these and more…

Investigating & visualizing protein structures

Page 45: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Input: Sequence

Output: Secondary structure prediction, globular regions, coiled-coil regions, transmembrane helices, PROSITE motifs, bound cystein…

The Meta Predict Protein server now allows many other options…

http://www.predictprotein.org/meta.php

Predicting 2° Structure

Investigating & visualizing protein structures

Page 46: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

A common measure is Q3 = the % of amino acids that were predicted correctly.

Today, Q3 is about 75-78% (as determined objectively by CASP)

The theoretical limit is thougt to be about 90%

Authors Year % acurracy MethodChou-Fasman 1974 50% propensities of aa's in 2nd structures Garnier 1978 62% interactions between aa'sLevin 1993 69% multiple seq. alignments (MSA)Rost & Sander 1994 72% neural networks + MSA

Predicting 2° Structure

Investigating & visualizing protein structures

Page 47: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Predicting 2° Structure

Investigating & visualizing protein structures

E.g. PSIPRED http://bioinf.cs.ucl.ac.uk/psipred/

psiform.html • A simple and accurate secondary structure prediction method, incorporating two feed-forward neural networks which perform an analysis on output obtained from PSI-BLAST.

• Using a very stringent cross validation method to evaluate the method's performance, PSIPRED recent version achieves an average Q3 score of 80.7%.

Page 48: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Protein 3D Structures

A protein’s structure has a critical effect on its function:

1. Binding pockets

PDB ID 1nw7

Investigating & visualizing protein structures

Page 49: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

A protein’s structure has a critical effect on its function:

2. Areas of specific chemical\electrical properties

Protein 3D Structures

Investigating & visualizing protein structures

Page 50: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

A protein’s structure has a critical effect on its function:

3. Importance of the global fold for function

Protein 3D Structures

Investigating & visualizing protein structures

Page 51: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Tertiary structure = protein fold

Complete 3-dimensional structure

Why is it interesting ? isn’t the sequence enough?

A key to understand protein function

Structure-based drug design

Detection of distant evolutionary relationships

The structure is more conserved

Investigating & visualizing protein structures

Page 52: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Investigating & visualizing protein structures

RCSB- the Protein Data Bank

• The main & comprehensive database for biological macro-molecular structures

• Each structure receives a PDB ID: a 4 letters unique identifier

• Search by author, PDB id or any keyword.

• Download structures

Page 53: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

RCSB- Protein Databankhttp://www.rcsb.org/pdb/home/home.do

PDB ID: 3mht

Investigating & visualizing protein structures

Page 54: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

RCSB- The Protein Data BankDownload structure

Displaystructure

Data concerning the structure -

resolution, R-value.…

The paper describingthe structure

Investigating & visualizing protein structures

Page 55: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

RCSB- The Protein Data Bank

• TITLE• REMARK• COMPND• JRNL- reference• SEQRES- the original sequence• HELIX, BETA- secondary structure• ATOM – The actual protein/DNA/RNA chain• HETATM- additional atoms such as ligands, water etc.• …

PDB files have a specific format:

Investigating & visualizing protein structures

Page 56: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

RCSB – The Protein Data BankPDB files have a specific format:

ATOM 7 SD MET A 1 -29.059 28.614 71.539 1.00 26.90 S ATOM 8 CE MET A 1 -27.535 29.074 70.866 1.00 16.57 C ATOM 9 N ILE A 2 -29.656 32.903 69.094 1.00 25.93 N ATOM 10 CA ILE A 2 -30.077 33.171 67.730 1.00 25.49 C HETATM 3139 C6 SAH 328 -11.642 26.514 89.489 1.00 17.97 C HETATM 3140 N6 SAH 328 -10.474 26.661 90.103 1.00 14.50 N HETATM 3141 N1 SAH 328 -11.895 25.334 88.899 1.00 23.10 N HETATM 3142 C2 SAH 328 -13.079 25.090 88.350 1.00 16.93 C HETATM 3143 N3 SAH 328 -14.120 25.887 88.278 1.00 16.05 N HETATM 3144 C4 SAH 328 -13.832 27.092 88.861 1.00 14.31 C HETATM 3145 O HOH 329 -29.525 42.890 90.934 1.00 24.84 O HETATM 3146 O HOH 330 -28.213 42.867 93.588 1.00 8.11 O HETATM 3147 O HOH 331 -24.619 35.287 96.173 1.00 17.96 O

Coordinates: X, Y,ZAtom, residueor molecule

Chain if existsNumbering

http://www.wwpdb.org/documentation/format3.1-20080211.pdf

Investigating & visualizing protein structures

Page 57: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

More Sequences Than Structures

Discrepancy between the number of known sequences and solved structures:

5,047,807 UniRef90  entries vs. 19988 90% Non-redundant structures

Computational methods are needed to

obtain more structures

RCSB – The Protein Data Bank

Investigating & visualizing protein structures

Page 58: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Fold classification

Classification: clustering proteins into structural families

Motivation?

Profound analysis of evolutionary mechanisms

Constraints on secondary structure packing?

Classification at domain level

Investigating & visualizing protein structures

Page 59: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Investigating & visualizing protein structures

Fold classificationhttp://scop.berkeley.edu

• The SCOP database aims to provide a description of the structural and evolutionary relationships between all proteins whose structure is known, including all entries in the PDB.

• The SCOP classification of proteins has been constructed manually, but with the assistance of tools to make the task manageable and help provide generality.

Page 60: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Investigating & visualizing protein structures

Fold classification

1. Family: Clear evolutionarily relationshipGenerally, this means that pairwise residue identities between the proteins are 30% and greater.

2. Superfamily: Probable common evolutionary originProteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary origin is probable are placed together in superfamilies.

Page 61: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Investigating & visualizing protein structures

Fold classification

3. Fold: Major structural similaritySame major secondary structures in the same arrangement and with the same topological connections. Different proteins with the same fold often have peripheral elements of secondary structure and turn regions that differ in size and conformation. In some cases, these differing peripheral regions may comprise half the structure.

Proteins of the same fold category may not have a common evolutionary origin: the structural similarities could arise from convergent evolution.

Page 62: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Growth of unique folds as defined by SCOP

Year

Nu

mb

er

Investigating & visualizing protein structures

Page 63: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Hierarchical classification of

protein domain structures in the

PDB.

Domains are clustered at five

major levels:

Class

Architecture

Topology

Homologous superfamily

Sequence family

Fold classification

Investigating & visualizing protein structures

Page 64: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Fold classification

Investigating & visualizing protein structures

• Class [C] - derived from secondary structure content (automatic)- alpha, beta, alpha and beta, few.

• Architecture [A] - derived from orientation of secondary structures (manual)

• Topology [T] - derived from topological connection and secondary structures- (by automated structural alignment)

• Homologous Superfamily [H]/sequence family- clusters of similar structures & functions.

Page 65: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Investigating & visualizing protein structures

SCOP Vs. CATH

Same SCOP family, different CATH topologies: d1rh6b (a.6.1.7) / 1rh6B00 (1.10.1660.20) vs. d1g4da(a.6.1.7) / 1g4dA00 (1.10.10.10)

Different SCOP classes, same CATH homologous superfamilies: d1bbxd (b.34.13.1) / 1bbxD00(2.40.50.40) vs. d1rhpa (d.9.1.1) / 1rhpA00 (2.40.50.40)

Csaba et al., 2009

Page 66: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Investigating & visualizing protein structures

SCOP Vs. CATH

SCOP CATHclass class

architecturefold topology

homologous superfamilysuperfamilyfamily sequence family

CATH more directed toward structural classification,SCOP pays more attention to evolutionary

relationships

Page 67: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

PdbSum

• A database providing an overview of all biological macromolecular structures

• Connected to UniProt find the sequence accession of a known PDB ID

• Detailed description of many structure properties, e.g.:– EC number– Chains & ligands and their interactions– Clefts– Secondary structure– FASTA sequence of structure…– …

Investigating & visualizing protein structures

Page 68: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

PdbSumPDB ID

Free text

Search by sequence

http://www.ebi.ac.uk/thornton-srv/databases/pdbsum/

Investigating & visualizing protein structures

Page 69: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

PdbSum

Useful tabs

UniProtaccession

Chains &

ligands

Investigating & visualizing protein structures

Page 70: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

PdbSum

Highlights fromthe related paper

EC and reaction

GO annotation

Investigating & visualizing protein structures

Page 71: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

PdbSum

Protein tab

Secondary structure-from the PDB

Investigating & visualizing protein structures

Page 72: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

PdbSum

Ligand tab

LigPlot-Predicts the residues that

bind the ligand

The ligand’sstructure

Investigating & visualizing protein structures

Page 73: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Before the invention of computer graphics, trained artists were employed for hand-drawing understandable picture of

a protein

Irving Geis (1908 – 1997)

Investigating & visualizing protein structures

Page 74: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

PyMol ViewerFeatures:

• Viewing 3D Structures

• Rendering Figures

• Giving Presentations

• Animating Molecules

• Sharing Visualizations

• Exporting Geometry

Investigating & visualizing protein structures

Page 75: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Pymol Viewer:Potassium channel from (kcsa) from streptomyces lividans, pdb id 1bl8

Declan et al., 1998

Investigating & visualizing protein structures

Page 76: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

• Identify the different parts of the screen:-the external GUI window-the internal GUI window.

• The internal window contains the viewer, which displays the molecule, and the command line.

View Manipulation

Investigating & visualizing protein structures

Page 77: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

View Manipulation

To manipulate an object, we use the letter icons near its name- A – Action- S – Show- H – Hide- L – Label- C – Color

Investigating & visualizing protein structures

Page 78: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

View Manipulation

Change the representation of the object to “Cartoon” using: S (show) As Cartoon

Investigating & visualizing protein structures

Page 79: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

View Manipulation

Other protein representations under “S” “As”:• Lines

•Ribbons

• Sticks

• Dots

• Spheres

• Surface

Investigating & visualizing protein structures

Page 80: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Part 1: View Manipulation

Color by chain: C (color) by chain

Investigating & visualizing protein structures

Page 81: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

View Manipulation

Other coloring options:

• Color by spectrum: b-factor, rainbow

• Color by secondary structure (“SS”)

• Color by element:

• A lot of available colors, other can be defined in the external GUI“settings””colors…” “new”

Investigating & visualizing protein structures

Page 82: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

•Select specific amino acids by clicking on them .

•Select a range in the sequence by clicking the first residue, and then “shift+click” on the last residue.

•The selection will be indicated on the structure (in pink dots).

Selecting and manipulating specific parts of the molecule

Investigating & visualizing protein structures

Page 83: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Selecting and manipulating specific parts of the molecule

• In the object list, a new object “(sele)” was added.•This object represents the current selection

• You can manipulate it with the buttons next to the object. For example, change its representation to sticks•(“S” “As” “Sticks”)

Investigating & visualizing protein structures

Page 84: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Selecting and manipulating specific parts of the molecule

• Give a different name to the selection, so you can easily manipulate it later.

•Select the first chain again (using the sequence) and change it name to “chain1” by pressing: “Action Rename Selection” and typing “chain1”.

Investigating & visualizing protein structures

Page 85: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Making high-quality photos

1. Change the background color to white, with“Display Background White”

on the external GUI menu:

Investigating & visualizing protein structures

Page 86: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Making high-quality photos2. Type in the command line: “ray [x], [y]” ”… wait…

3. Save the image by: “Save” “Image

Pay attention not to accidentally press on the image before saving!

Investigating & visualizing protein structures

Page 87: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Making high-quality photos

Investigating & visualizing protein structures

Page 88: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Making high-quality photos

Investigating & visualizing protein structures

Page 89: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Approach: Functionally important amino acid sites are often evolutionarily conserved

ConSurfThe goal: identification of functionally

important amino acids that mediate the interaction of a query protein with

ligands, DNA/RNA, other proteins etc.

Investigating & visualizing protein structures

Page 90: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Beta Class N6-Adenine DNA Methyltransferase

Investigating & visualizing protein structures Consurf

Page 91: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

The 3D structure ofBeta Class N6-Adenine DNA Methyltransferase

has already been solved:

PDB id : 1nw7

Investigating & visualizing protein structures ConSurf

Page 92: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

• The ConSurf webserver calculates the evolutionary rate for each position in the protein

• The results, mapped on the structure, reveal residues crucial for function and structure stability

• In this case, the ligand is bound in a highly conserved cluster of residues

http://consurf.tau.ac.il/

Investigating & visualizing protein structures Consurf

Page 93: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

The consensus sequence approach:

..W..

..W..

..W..

..W..

.. E..

.. G..

Investigating & visualizing protein structures Consurf

Page 94: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

However, some

sequences might be

close homologues of each other

primates

..W..

..W..

..W..

..W..

.. E..

.. G..

Investigating & visualizing protein structures Consurf

Conclusion: Assessing conservation without taking into

consideration the phylogenetic relations may lead to uneven sampling in sequence space

Page 95: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Phylogenetic reconstruction may be used to distinguish between two possible cases:

1.Structural/functional constraints that truly result in sequence conservation as a result of evolutionary pressure.

2. Short evolutionary time that may be mistaken as sequence conservation, while no evolutionary pressure affects the examined position.

Investigating & visualizing protein structures Consurf

Page 96: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Rate4Site:an algorithm for calculating the

evolutionary rate at each amino acid site

Conserved sites evolve slowlyvariable sites evolve rapidly

Definition: Evolutionary rate = number of AA replacements/(site*year)

Pupko et al., 2002Mayrose et al., 2005

Investigating & visualizing protein structures Consurf

Page 97: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Landau et al., 2005

Web-Server: http://consurf.tau.ac.il/

Investigating & visualizing protein structures Consurf

Page 98: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

The Rate4Site conservation scores are not specificintegers.

Such scores are impossible to display on a structure.

Hence, the ConSurf webserver divides them into 9

bins- 1 for highly variable , 9 for the most conserved

Investigating & visualizing protein structures

Consurf coloring bar

Page 99: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

The ConSurf webserver

Essential input- MSA and tree constructed

by ConSurf through “advanced options”:1. PDB ID\PDB file\model-structure and chain

Essential and optional input:1. PDB ID\PDB file\model-structure and chain 2. Constructed MSA, with the query sequence

included3. Phylogenetic tree

Investigating & visualizing protein structures Consurf

Page 100: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

http://consurf.tau.ac.il/index.html

Page 101: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

BayesianMax Likelihood

1NW7

Check inthe PDBsum…

http://consurf.tau.ac.il/index.html

Essential and Optional input:

MSA

Sequence namein the MSA

Tree

Email

Page 102: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Essential input:

http://consurf.tau.ac.il/index.html

1NW7

Check inthe PDBsum…

Page 103: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Essential input:

http://consurf.tau.ac.il/index.html

SWISS-PROTUniProt

Email

Alignment method

Additional BLAST options

Page 104: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Calculation Finished:

Easy web-based viewer

View scores

Produced or input MSA

View phylogenetic tree

Script for coloring in RasTop*

Instructions for PyMOl*

Viewer for producing medium-quality images*

Page 105: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Jmol- Easy web-based viewer

Investigating & visualizing protein structures Consurf

Page 106: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Summary - MSA Quality• ConSurf is dependent on the quality of the MSA.

• When an MSA is not given by the user, sequences are automatically gathered by PSI-BLAST and aligned by CLUSTALW with default parameters.

• Even though these alignments are usually good, it is highly recommended to inspect the alignment manually and with other tools in order to improve the quality of the evolutionary data .

Investigating & visualizing protein structures Consurf

Page 107: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

A caveat: In some cases the functionally important region may

not be conserved at all

The peptide-binding groove of the MHC class I heavy chain.

PDB id : 2vaa

Investigating & visualizing protein structures Consurf

Page 108: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Patch- a spatially continuous cluster of surface residues.

Problems:– Subjectivity of

boundaries. – Difficult to apply on large

datasets

PatchFinder-identification of functional sites

Investigating & visualizing protein structures

Page 109: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

1Nimrod et al., 20052Nimrod et al, 2008

3Mayrose et al., 2004

(1) Assignment of conservation scores

(Rate4Site3)

(4) Identification of non-overlapping secondary patches

(2) Identification of exposed residues

(3) Extraction of the surface patch of conserved residues with the highest statistical significance (ML-patch).

Input: 1. Protein Structure 2. Multiple sequence alignment (MSA)

PatchFinder

Investigating & visualizing protein structures

Page 110: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

PatchFinder- http://patchfinder.tau.ac.il/

Investigating & visualizing protein structures

Page 111: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Investigating & visualizing protein structures

Summary of structure-relateddatabases & tools

• Secondary structure prediction- PredictProtein, Meta PredictProtein and PSIPRED.

• PDB, SCOP and CATH- collection and classification of structures available by experiment.

• Structure visualization- PyMol

• Conservation analysis- Consurf and Patchfinder

Page 112: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Protein structure prediction

Page 113: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Structure Prediction Approaches

1. Homology (Comparative) Modeling

Based on sequence similarity with a protein for

which a structure has been solved.

2. Threading (Fold Recognition)

Requires a structure similar to a known structure

3. Ab-initio fold prediction

Not based on similarity to a sequence\structure

Page 114: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Ab-initioStructure prediction from “first principals”:

Given only the sequence, try to predict the structure

based on physico-chemical properties

(energy, hydrophobicity etc.)

• When all else fails works for novel folds

• Shows that we understand the process

Page 115: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

The Force Field(energy function)

A group of mathematical expressions describing the

potential energy of a molecular system

• Each expression describes a different type of physico-

chemical interaction between atoms in the system:

• Van der Waals forces

• Covalent bonds

• Hydrogen bonds

• Charges

• Hydrophobic effects

Non-bonded terms

Page 116: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Approaches to Ab-initio Prediction1. Molecular Dynamics

• Simulates the forces that governs the protein within water.• Since proteins usually naturally fold, this would lead to the

native protein structure.

Problems:• Thousands of atoms• Huge number of time steps to reach folded protein

feasible only for very small proteins

Page 117: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Approaches to Ab-initio Prediction

2. Minimal Energy

Assumption: the folded form is the minimal energy conformation of a protein

Main principals:• Define an energy function.• Search for 3D conformation that minimize energy.

Page 118: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Ab-initio

2. Minimal Energy

• Use of simplified energy function

• Search methods for minimal energy conformation:

– Greedy search

– Simulated annealing

– …

Page 119: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

• Current methods (e.g. Rosetta) primarily utilize the fact that although we are far from observing all protein folds, we probably have seen nearly all sub-structures:

Ab-initio

Moult J. Philos. Trans. R. Soc. B. 361:453–458 (2006)

Local sequence-structure relationships:

• A library of known sub-structures (fragments less than 10 residues) is created.

• A range of possible conformations for each fragment in the query protein are selected.

Page 120: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Ab-initio

Moult J. Philos. Trans. R. Soc. B. 361:453–458 (2006)

Non-local sequence-structure relationships:

• The primary nonlocal interactions considered are hydrophobic burial, electrostatics, main-chain hydrogen bonding etc.

Structures that are consistent with both the local and non-local interactions are generated by minimizing the non-local interaction energy in the space definedby the local structure distributions.

Page 121: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Ab-initio - Example

Moult J. Philos. Trans. R. Soc. B. 361:453–458 (2006)

Page 122: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Given a sequence and a library of folds, thread the sequence

through each fold. Take the one with the highest score.

• Method will fail if new protein does not belong to any fold in

the library.

• Score of the threading is computed based on known

physical chemistry properties and statistics of amino acids.

Fold Recognition(Threading)

Page 123: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

EEabab A C D E …..

A -3 -1 0 0 ..C -1 -4 1 2 ..D 0 1 5 6 ..E 0 2 6 7 ... . . . .

ACCECADAAC -3-1-4-4-1-4-3-3=-23

• structural templatestructural template

• neighbor definitionneighbor definition

• energy functionenergy function

11

22

33

44

55

66

77

1010

88

99

AA

CC

CC

EE

CC

AA

DDAA

AA

CC

E Eji, positions

ba ji

Threading: example

Page 124: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

MAHFPGFGQSLLFGYPVYVFGD...

Potential fold

...

1) ... 56) ... n)

...

-10 ... -123 ... 20.5

Find best fold for a protein sequence: Fold recognition (threading)

Page 125: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

GenTHREADER

• Align the query sequence with each template (requires some sequence homology!)

• Assess the alignment by:– Sequence alignment score– Pairwise potentials– Solvation function

• Record lengths of: alignment, query, template

• Using Neural Network the overall score is computed.Jones DT et al. J. Mol. Biol. 287: 797-815(1999)

Page 126: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

GenTHREADER

Jones DT et al. J. Mol. Biol. 287: 797-815(1999)

Page 127: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

I-TASSER- Hybrid Approach

• In a recent wide blind experiment, CASP7, I-TASSER generated the best 3D structure predictions among all automated servers.

•Based on the secondary-structure threading and the iterative implementation of the Threading ASSEmbly Refinement (TASSER) program.

•For predicting the biological function of the protein, the I-TASSER server matches the predicted 3D models to the proteins in 3 independent libraries which consist of proteins of known enzyme classification (EC) number, gene ontology (GO) vocabulary, and ligand-binding sites.

Page 128: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

I-TASSER

Page 129: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Test Case:Rosetta Vs. TASSER

Grey: Crystal structure of Beta-nnnn:

Purple: Rosetta prediction, starting from homology modeling

Green: TASSER predcition

Page 130: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Homology Modeling – Basic Idea

Triophospate ismoerases44.7% sequence identity0.95 RMSD

1. A protein structure is defined by its amino acid sequence.

2. Closely related sequences adopt highly similar structures, distantly related sequences may still fold into similar structures.

3. Three-dimensional structure of

proteins from the same family is

more conserved than their

primary sequences.

Page 131: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

General Scheme

1. Searching for structures related to the query sequence

2. Selecting templates

3. Aligning query sequence with template structures

4. Building a model for the query using information from the template structures

5. Evaluating the model

Fiser A et al. Methods in Enzymology 374: 461-491(2004)

Page 132: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

General Scheme

Page 133: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Homology modeling requires handling structures & sequences

• Query- only the protein sequence is available- usually found at the UniProt database

• Template- after identification, both structural and sequence-related data should be found- UniPort (or NCBI databases), RCSB and PDBsum

Page 134: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Homology modeling- query-template alignment

Different levels of similarity between the template & query initiate various computational approaches:

Page 135: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Evolutionary Conservation

Homology modeling- model evaluation

http://consurf.tau.ac.il

Page 136: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

http://consurf.tau.ac.il

Homology modeling- model evaluation

Evolutionary Conservation

Page 137: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

http://consurf.tau.ac.il

Homology modeling- model evaluation

Evolutionary Conservation

Page 138: Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

Homology Modeling

• The accuracy of the model depends on its sequence identity with the template: