dalla sequenza alla struttura

Dalla sequenza Dalla sequenza alla strutturaalla struttura

Mauro Fasano

Dipartimento di Biologia Strutturale e FunzionaleCentro di Neuroscienze

Università dell’Insubria – Busto [email protected]

http://fisio.dipbsf.uninsubria.it/cns/fasano

mailto:[email protected]

http://fisio.dipbsf.uninsubria.it/cns/fasano

Dalla sequenza alla struttura

O2

VLSEGEWQLVLV . . .Sequenza Struttura Funzione

Che informazioni offre la struttura?

• Conformazione dei siti attivi e di legame• Orientazione dei residui conservati• Interpretazione di meccanismi• Visualizzazione di cavità• Calcolo di potenziale elettrostatico• …

Esempio

• FtsZ – divisione cellulare in procarioti, mitocondri e cloroplasti.

• Tubulina – componente strutturale dei microtubuli – comunicazione intracellulare e divisione cellulare.

• FtsZ e Tubulina hanno bassa similarità di sequenza e non sembrerebbero omologhe.

Burns, R., Nature 391:121-123Picture from E. Nogales

FtsZ e tubulina sono omologhe?

• Proteine che hanno conservato la struttura tridimensionale possono derivare da un progenitore comune anche se la divergenza della sequenza non permette più di riconoscere l’omologia.

Un altro esempio

• α-lattalbumina e lisozima possiedono:– Stesso fold– Moderata similarità– Diversa funzione

Metodi sperimentali:

• Diffrazione dei raggi x

• Risonanza magnetica nucleare

Cristallografia a raggi X

• Ottenere cristalli della proteina– 0.3-1.0 mm– Le singole molecole sono ordinate in modo

periodico, ripetitivo. • La struttura è determinata dai dati di

diffrazione.

Image from http://www-structure.llnl.gov/Xray/101index.html

http://www-structure.llnl.gov/Xray/101index.html

Schmid, M. Trends in Microbiology, 10:s27-s31.

• Le proteine devono cristallizzare– Grande quantità– Solubili

• Accesso a radiazione adatta• Tempo di calcolo per risolvere la struttura

Cristallografia a raggi X

Risonanza Magnetica Nucleare (NMR)

• Proteine in soluzione• Limite di dimensione ~ 40 kDa• Proteine stabili a lungo• Marcatura con 15N, 13C, 2H.• Strumentazione molto costosa• Tempo per assegnare le risonanze

Il Protein Data Bank

Crescita del PDB

Motivi strutturali depositati ogni anno

Percentuale di nuovi motivi strutturali

HEADER BINDING PROTEIN 01-JUN-95 1HXN 1HXN 2COMPND MOL_ID: 1; 1HXN 3COMPND 2 MOLECULE: HEMOPEXIN; 1HXN 4COMPND 3 CHAIN: NULL; 1HXN 5COMPND 4 DOMAIN: C-TERMINAL DOMAIN; 1HXN 6COMPND 5 SYNONYM: HPX; 1HXN 7COMPND 6 HETEROGEN: PO4 1HXN 8SOURCE MOL_ID: 1; 1HXN 9SOURCE 2 ORGANISM_SCIENTIFIC: ORYCTOLAGUS CUNICULUS; 1HXN 10SOURCE 3 ORGANISM_COMMON: RABBIT; 1HXN 11SOURCE 4 TISSUE: SERUM 1HXN 12KEYWDS HEME 1HXN 13EXPDTA X-RAY DIFFRACTION 1HXN 14AUTHOR H.R.FABER,E.N.BAKER 1HXN 15REVDAT 1 15-OCT-95 1HXN 0 1HXN 16JRNL AUTH H.R.FABER,C.R.GROOM,H.BAKER,W.MORGAN,A.SMITH, 1HXN 17JRNL AUTH 2 E.N.BAKER 1HXN 18JRNL TITL 1.8 ANGSTROMS CRYSTAL STRUCTURE OF THE C-TERMINAL 1HXN 19JRNL TITL 2 DOMAIN OF RABBIT SERUM HEMOPEXIN 1HXN 20JRNL REF TO BE PUBLISHED 1HXN 21JRNL REFN 0353 1HXN 22REMARK 1 1HXN 23

ATOM 1 CA GLU 225 -0.900 -1.002 39.233 1.00 70.00 1HXN 170ATOM 2 C GLU 225 -0.185 0.146 39.970 1.00 70.00 1HXN 171ATOM 3 O GLU 225 -0.514 1.329 39.758 1.00 70.00 1HXN 172ATOM 4 N SER 226 0.788 -0.203 40.823 1.00 70.00 1HXN 173ATOM 5 CA SER 226 1.534 0.805 41.594 1.00 70.00 1HXN 174ATOM 6 C SER 226 2.231 1.806 40.681 1.00 68.89 1HXN 175ATOM 7 O SER 226 1.883 1.952 39.514 1.00 70.00 1HXN 176ATOM 8 CB SER 226 2.572 0.130 42.515 1.00 70.00 1HXN 177ATOM 9 OG SER 226 3.237 -0.941 41.848 1.00 70.00 1HXN 178ATOM 10 N THR 227 3.242 2.478 41.223 1.00 65.51 1HXN 179ATOM 11 CA THR 227 3.989 3.417 40.410 1.00 70.00 1HXN 180ATOM 12 C THR 227 4.274 2.705 39.080 1.00 56.25 1HXN 181ATOM 13 O THR 227 4.179 3.296 38.022 1.00 44.63 1HXN 182ATOM 14 CB THR 227 5.354 3.797 41.074 1.00 70.00 1HXN 183ATOM 15 OG1 THR 227 5.114 4.682 42.172 1.00 70.00 1HXN 184ATOM 16 CG2 THR 227 6.256 4.492 40.065 1.00 70.00 1HXN 185

http://www.expasy.ch/spdbv

Esegui

Classificazione delle proteine:

• SCOP (Structural Classification of Proteins, scop.mrc-lmb.cam.ac.uk/scop/, Murzin et. al.):

548 folds (major structural similarity in terms of secondary structures e.g. globin-like, Rossman fold); 1296 families (clear evolutionary relationship or homology e.g. globins, Ras)

• CATH (Class, Architecture, Topology, Homologous Superfamily, www.biochem.ucl.ac.uk/bsm/cath/, Orengo et. al):

35 architectures (gross arrangment of secondary structures e.g. non-bundle, sandwich); 580 topologies (connectivity of secondary structures e.g. globin-like, Rossman fold); 1846 families (clear homology, same function)

Structural Classification Of Proteins

Predizione della struttura secondaria e terziaria

Metodi predittivi

»Comparative modeling> 30% similitudine

»Threading/Fold recognition0 – 30% similitudine

»Ab initionessun omologo

Qualità del modello comparativo

Identità di sequenza:

60-100% Confrontabile con NMR media risoluzioneSpecificità di substrato

30-60% Molecular replacement in cristallografiaPartenza per site-directed mutagenesis

<30% Gravi errori

Building by homology (Homology modelling)

---

G

---

Y

---

M

AAAA

KSTA

AGGG

YFFY

LEDA

VVVV

LVI

L

SEDS

Allineamento con proteine a struttura nota

Modello strutturale

Fold recognition (Threading)

Sequenza:

+Motivi strutturali noti

SLVAYGAAM

Modello strutturale

Ab initio

Sequenza

SLVAYGAAM

Modello strutturale

General Flowchart

Un numero grandissimo di polipeptidi si struttura in un numero finito (e relativamente piccolo) di folds

Almeno una proteina su due di quelle presenti nel database ha un omologo (identità > 30%) che quasi sempre ha lo stesso fold.

Building by homology

Costruire il modello comparativo

1) Cercare il massimo numero di omologhi che possiedano una entry nel PDB. Strumenti che utilizzano PSSM sono più sensibili. In questo caso vengono utilizzate sequenze senza struttura per costruire la PSSM.

2) Costruire un accurato allineamento multiplo tra la sequenza da modellare e tutte le entries che verranno utilizzate come templato.

Trovare strutture di proteine la cui sequenza è simile

allineamento

Modello strutturale

Verifica

OK!

Costruire il modello stesso

Determinare la struttura secondaria in base all’allineamento

Costruire le regioni conservate. Per ciascuna regione possiamo prendere le coordinate del frammento con la maggior similarità di sequenza.

Costruire le regioni variabili, solitamente loops.

Usando raccolte di loops osservati in strutture note, in base alla loro lunghezza ed alla loro sequenza

Costruendo la conformazione del loop ab initio. Vengono generate numerose conformazioni casuali e si calcola l’energia in un opportuno campo di forze.

Costruzione dei loops:

Alcuni siti web di homology modeling

COMPOSER – felix.bioccam.ac.uk/soft-base.html

MODELLER – guitar.rockefeller.edu/modeller/modeller.html

WHAT IF – www.sander.embl-heidelberg.de/whatif/

SWISS-MODEL – www.expasy.ch/SWISS-MODEL.html

Swiss-Modelhttp://www.expasy.ch/swissmod/SWISS-MODEL.html

http://guitar.rockefeller.edu/modeller/about_modeller.shtml

Modeller

Advanced program for homology modeling

Based on distance constraints

Implemented in several popular modelling packages such as InsightIIThe source is available for unix platforms at the above URL

Threading (fold recognition)

La sequenza di input viene confrontata con una libreria di folds noti

Si calcola un punteggio che esprima la compatibilità tra la sequenza e ciascun fold considerato

Punteggi statisticamente significativi indicano che la sequenza ha una certa probabilità di assumere la stessa struttura 3D del fold considerato

Input:

Sequenza

Donatore HAccettore HGlyIdrofobico

Collezione di folds di proteine note

S=20S=5S=-2Z=5Z=1.5Z= -1

Donatore HAccettore HGlyIdrofobico

Chain/Domain Library

Scoring functions for fold recognition

Ci sono due metodi per valutare la compatibilità sequenza-struttura (1D-3D)

In methods based on structural profile, for every fold a profile is built based on structural features of the fold and compatability of every amino acid to the features.

The structural features of each position are determined based on the combination of secondary structure, solvent accessibility and the property of the local environment (hydrophobic/hydrophilic)

The profile is a defined mathematical structure, adjusted for pair-wise comparisons and dynamic programming

10100N

::::::::

10100167-9987-242

10100-80101-50101

GextGopY…DCA

Amino acid typePo

sitio

n on

sequ

ence

Contact potentials

This method is based on predefined tables which include pseudo-energetic scores to each pairwise interaction of two amino acids.

This method makes use of distance matrix for representation of different folds

For each pair of amino acids which are close in space the interaction energy is summed. The total sum is the indication for the fitness of the sequence into that structure

Scoring Function

…YKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEW…

Tendenza a stare in un certo ambiente: E_s

(singleton term)

Tendenza a stare vicini: E_p

(pairwise term)

Alignment gap penalty: E_g

Energia totale: E_m + E_p + E_s + E_g

Descrive quanto la sequenza assomiglia al templato

Qualità dell’allineamento in una certa posizione: E_m

(mutation term)

• • • • • • • • • • • • • •Amino acid index

Am

ino

acid

inde

x

••••

1

N1 N

•••• • •

••

Expected Performance

Predicted model

X-raystructure

PROSPECT prediction in CASP4:12 out 19 folds (no homology) recognized

Web sites for fold recognition

3D-PSSM - http://www.bmm.icnet.uk/~3dpssm

Profiles:

Libra I - http://www.ddbj.nig.ac.jp/htmls/E-mail/libra/LIBRA_I.html

UCLA DOE - http://www.doe-mbi.ucla.edu/people/frsvr/frsvr.html

Contact potentials

123D - http://www-Immb.ncifcrf.gov/~nicka/123D.html

Profit - http://lore.came.sbg.ac.at/home.html

Risultati

Ab initio methods for modelling

This field is of great theoretical interest but, so far, of very little practical applications. Here there is no use of sequence alignments and no direct use of known structures

The basic idea is to build empirical function that simulates real physical forces and potentials of chemical contacts

If we will have perfect function and we will be able to scan all the possible conformations, then we will be able to detect the correct fold

Algorithms for Ab initio prediction include:A. Searching procedure that scans many possible structures (conformations)B. Scoring function to evaluate and rank the structures

Due to the large search space, heuristic methods are usually applied

The parameters in the searching procedure are the dihedral angles which specify the exact fold of the polypeptide chain

A

C

B

E

D

A

C

B

ED

New Fold Methods

• Since almost all predictors use sequence and structural databases in some form, there is no longer an “ab initio” category

• Assessment is sometimes difficult to communicate due to the complexity of the protein structure and completeness of prediction

• Methods are still somewhat limited to smaller proteins

Rosetta-David Baker

• Based on the assumption that the distribution of conformations sampled by a local segment of the polypeptide chain is reasonably approximated by the distribution of structures adopted by that sequence and closely related sequences in known protein structures.

• Fragment libraries for all possible three and nine residue segments of the chain are extracted from PDB by profile methods

Rosetta-Simulation Procedure

• Information on fragments from secondary structure prediction methods compiled and scored based on equation for local secondary structure propensity

• Conformational space defined by these fragments is then searched by a Monte Carlo procedure with an energy function that favors compact structures with paired beta strands and buried hydrophobic residues, refinement of procedure late in simulation

• Thousands of structures generated• Filters to remove bad structures• Remaning structures clustered and cluster center taken

as the prediction.

Force fields- collection of terms that simulate the forces act between atoms

Terms based on probabilities to find pairs of amino acids or atoms within specific distances

Terms based on surface area and overlapping volume of spheres representing atoms

Methods to evaluate structures are based on

In homology modelling, construction of the side chains is done using the template structures when there is high similarity between the built protein and the templates

In spite of the huge size of the problem (because each side chain influences its neighbours) there are quite succesful algorithms to this problem.

Side chain construction

Without such similarity the construction can be done using rotamer libraries

A compromise between the probability of the rotamer and its fitness in specific position determines the score. Comparing the scores of all the rotamer for a given amino acid determines the preferred rotamer.

In this work we examined differences in structures of amino- acid side chains around point mutations.

Phe

AsnConformation - a given setof dihedral angle which defines a structure.

Rotamer - energetically favourable conformation.

SER 59.6 41.0 SER -62.5 26.4SER 179.6 32.6

Example to library of rotamers

TYR 63.6 90.5 21.0TYR 68.5 -89.6 16.4TYR 170.7 97.8 13.3TYR -175.0 -100.7 20.0TYR -60.1 96.6 10.0TYR -63.0 -101.6 19.3

Model evaluation

The main approaches for model evaluation are:A. Use of internal information (such as the one that used for the model construction)B. Use of external information derived from the databases

After the model is built we can check it by various methods.

If the model turns out to be bad, it is necessary to repeat several stages of the model building

Usually algorithms are checked by building models for proteins which have already solved structure and comparison between the model and the native structure

It is always possible that information from the native structure will be used in direct or indirect ways for model building

A more objective test is prediction of structures before they are publicly distributed (this is the idea of the CASP competitions)

dalla sequenza alla struttura

Documents