bioinformatics: introduction and methods

Bioinformatics: Introduction and Methods Le Zhang

Computer Science Department, Southwest University

Functional prediction of genetic variants

Le Zhang, Ph. D. Computer Science Department Southwest University

Unit 1: Overview of the problem


Do you think Angelina made the right decision to remove her breasts?

Angelina Joli has a genetic mutation in BRCA1.

How can we predict the likelihood of her getting breast cancer given this mutation? • P(breast cancer|her mutation) • P(breast cancer free|her mutation)

The dawning of the age of personalized medicine Next‐generation sequencing can sequence one person’s whole genome with ~$3000.

The personal genomes hold promises for a future of personalized medicine.

Where did your genetic variations come from?

somatic mutations de novo mutations inherited from parents

Annapurna Poduri et. al. Somatic Mutation, Genomic Variation, and Neurological Disease Science 5 July 2013: 341

Types of genetic variations in a human genome

• Chromosomal aneuploidy • Structural Variations (SVs) • Copy Number Variations (CNVs) • Short insertion/deletions (indels) • Single Nucleotide Variations (SNVs)

Nomenclature: Mutation vs. polymorphism vs. variation vs. variant

Structure Variation (SV) and Copy Number Variation (CNV) Insertion Deletion Inversion Translocation CNV

Indel – short Insertion/Deletion Within intergenic/intronic regions Within coding regions

Frameshifting Non‐frameshifting x

SNV – Single Nucleotide Variation There are about 3 million SNVs in one person’s genome, equivalent of ~ 1/1000 frequency.

SNVs within coding regions

Stop gain(nonsense)

Stop loss

Non‐synonymous(missense)

Synonymous(silent)

Affect splicing Missense mutation Nonsense mutation

Missense (nonsynonymous) SNVs

Missense SNVs change the amino acid.

Missense SNVs account for ~2% of the genome but >50% of all mutations known to be

involved in human inherited diseases.

BRCA1 vs. breast cancer

In 1990, DNA linkage studies on large families identified BRCA1 as the first gene associated with

breast cancer. BRCA1 located on chromosome 17 80,818 bp in length 23 exons encodes a protein of 1,863 amino acids a tumor suppressor gene that repairs damaged DNA and regulates cell growth and cell death. Approximately 5‐10% of breast cancers and 14% of ovarian cancers occur from a BRCA1 or BRCA2 genetic mutation.

However, not all missense SNVs cause phenotype change. Some are pathogenic, but many are neutral. Atotal of 238 known missense variations in BRCA1

163 are present only in patients

62 are present only in healthy persons

13 in both patients and healthy persons

On average, a healthy individual has

Class

Synonymous SNPs

Non‐synonymous SNPs

Small in‐frame indels

Small frameshift indels

Stop losses

Stop‐introducing SNPs

Genes disrupted by large deletions

Total genes containing LOF variants

HGMD ‘damaging mutation’ SNPs

Number

60,157

68,300

714

954

77

1,057

147

2,304

671

Class

SNP

Number

3,019,909

Indel

Deletions

Duplications

mobile element

insertions

361,669

15,893

407 4,775

Within protein‐coding regions,

Still an unsolved problem with lots of active on‐going research！

• What features differentiate disease‐causing variants from neutral ones? • How can we predict whether a variation is disease‐causing?

Unit 2: Databases of genetic variations


dbSNP

http://www.ncbi.nlm.nih.gov/SNP/

Created in September 1998 by by the

NCBI(National Center for Biotechnology Information) in collaboration with the NHGRI(National Human Genome Research Institute)

Its goal is to act as a single database

that contains all identified genetic variation

232,952,851 62,676,337 44,278,189 27,608,151 73,909,251 35,997,830

dbSNP New information obtained by dbSNP becomes available to the public periodically in a series of “builds”

Contains a range of molecular variation: SNPs Indels

multinucleotide polymorphisms microsatellite markers short tandem repeats heterozygous sequences

As of dbSNP build 138: Consist of variants from131 Organisms For Homo sapiens

Number of Submissions (ss) Number of RefSNP clusters (rs) Validated rs Number of rs in gene Number of ss with genotype Number of ss with frequency

dbSNP– Data increase From dbSNP build 125 in 2005 to build 138 in 2013, for Homo sapiens 250,000,000

200,000,000

150,000,000

100,000,000

50,000,000

0 2005 2007 2008 2009 2011 2012

Number of Submissions(ss)

Number of rs in gene Number of RefSNP Clusters(rs)

dbSNP- Record

1000 Genomes http://www.1000genomes.org/ The 1000 Genomes Project, launched in January 2008, is an international research effort to establish by far the most detailed catalogue of human genetic variation. Pilot‐ In 2010, the project finished its pilot phase Phase I ‐ In October 2012, the sequencing of 1092 genomes was announced in a Nature publication

1000 Genomes

1000 Genomes

Sequencing technology used:

Illumina SOLID 454

Phase I Whole genome Whole exome

strategy Low coverage whole genome sequencing

Deeping sequencing of whole

exome

Coverage 2‐6X 50‐100X

Sample number

1,092 1,039

OMIM Online Mendelian Inheritance in Man A database catalogues all the known diseases with a genetic component, and links them to the relevant genes in the human genome Contain information on all known mendelian disorders and over 12,000 genes.

http://www.omim.org/

OMIM initiated in the early 1960s by Dr. Victor A. McKusick as a catalog of mendelian traits and disorders, entitled Mendelian Inheritance in Man as a book 12 book editions of MIM were published between 1966 and 1998

The online version, OMIM, was created in 1985 and made generally available on the internet starting in 1987.

OMIM Entry Statistics

Human Gene Mutation Database (HGMD)

a comprehensive collection of germline mutations in nuclear genes that underlie,

or are associated with, human inherited disease.

By 2013, the database contained over 141,000 different variants detected in over

5,700 different genes

Two versions: Professional – need subscription every year Public – freely available but permanently 3 years out of date, and does not contain any of the additional annotations or extra features present in HGMD Professional

Human Gene Mutation Database (HGMD)

Created by biologist David N. Cooper and mathematician Michael

Krawczak in 1996.

Originally established for the scientific study of mutational mechanisms

in human genes causing inherited disease, but has since acquired a much broader utility as a central unified repository for germ‐line disease‐related functional variation.

All HGMD mutation data are manually curated from the scientific

literature.

HGMD 2013.2

HGMD http://www.hgmd.cf.ac.uk/ac/index.php

Locus specific databases (LSDBs)

Collect all known variants of each disease related gene in a specific database

Annotate with Complete and accurate information on genetic mutations

Most LSDBs are build based on LOVD (Leiden Open Variation Database) which is a database framework of storing variants information

http://www.lovd.nl/3.0/home

Unit 3: Conservation-base and Rule-based

methods: SIFT & PolyPhen


Questions:

• What features differentiate disease‐causing variants from neutral ones?

• How can we predict whether a variation is disease‐causing?

Phenotypical/functional “effects” of human genetic variations

• Disease vs. normal • Deleterious vs. neutral

• Personal trait differences (e.g., height)

Observations, not “truth”

Statistical and stochastic, not deterministic

• Animal model phenotypic changes • Cellular phenotypic changes

• Protein function changes

• Protein structure changes

• Protein sequence changes

• Nonsense mutations are usually considered deleterious. • even though it is not always the case…

• Known deleterious mutations are enriched in nonsynonymous mutations. • ~50 known mutations of Mendelian disorders are nonsynonymous mutations

• ascertainment bias?

• synonymous mutations, intronic mutations, and intergenic mutations are under‐ studied. • According to GWAS studies, 88% of trait‐associated variants of weak effect are non‐coding.

• Most research so far had focused on nonsynonymous mutations.

1999: Earliest attempt based on BLOSUM substitution matrix

• Assumption: if the substitution score between a variant residue and the wild type residue is positive, then the variant is neutral. If the substitution score is negative, then the variant is deleterious.

More successful methods

• Conservation‐based (e.g., SIFT)

• Rule‐based (e.g., PolyPhen)

• Classifier‐based (e.g., PolyPhen2, SAPRED)

Sort Intolerant From Tolerant substitutions (SIFT)

Published in 2001 by Pauline C. Ng and Steven Henikoff The first tool of predicting deleterious Amino Acid Subsitutions Website: http://sift.jcvi.org/

SIFT bets on evolution Important positions (such as active sites) tend to be conserved in the protein family across species. • Mutations at well‐conserved positions tend to be deleterious.

Some positions have a high degree of diversity across species. • Mutations at these positions tend to be neutral.

SIFT is a multistep procedure

Given a protein sequence:

Step 1. Search for similar sequences

Sequence search database: SWISS‐PROT

PSI‐blast is run for four iterations to collect a pool of sequences similar to the query

Step 2. Choose closely related sequences that are likely to share similar function

The psi‐blast results are grouped together if they are >90% identical in the regions aligned

Step 3. Obtain the multiple alignment of these chosen sequences

Step 4. Calculate normalized probabilities for all possible substitutions at each position at the alignment

If the SIFT score is less than 0.05, the SNV is considered to be deleterious. Otherwise, it is considered neutral.

Prediction results

Score cutoff: 0.05

Accuracy of SIFT False Negative rate: 31% False Positive rate: 20% Coverage: 60%

Truth("Goldstandard")

Positive Negative

Test

Outcome

Positive TruePositive

(hit)

FalsePositive (falsealarm)

Positivepredictivevalue

(PPV)=

Precision=

TP/(TP+FP)

Negative FalseNegative

(miss)

TrueNegative (correctrejection)

Negativepredictivevalue

(NPV)=

TN/(TN+FN)

Sensitivity=

Recall=

TP/(TP+FN)

Specificity=

TN/(TN+FP)

Accuracy=

(TP+TN)/total

Falsenegativerate

(β)=

TypeIIerror=

1-sensitivity=

FN/(TP+FN)

Falsepositiverate

(α)=

TypeIerror=

1-specificity=

FP/(TN+FP)

Falsediscoveryrate

(FDR)=

1-precision=

FP/(TP+FP)

Polymorphism Phenotyping (PolyPhen): a rule‐based method Amino acid variants may impact folding, interaction sites, solubility or stability of the protein.

Changes in protein structure may affect protein function, which may lead to phenotype change.

PolyPhen predicts impact of amino acid allelic variants based on multi‐sequence alignment AND protein 3D structure features

PolyPhen

PolyPhen

1. Multi‐sequence alignment of homologous sequences

2. Structure‐based characterization of the substitution site DISULFIDE, THIOLEST or THIOEATH bond, BINDING site, ACTIVE site etc. Whether the variant is located in transmembrane regions Whether the variant is located in coiled coil regions Whether the variant is located in signal peptide regions

PolyPhen 3. Get the protein 3D structure or using homolog modeling to predict its structure 4. Calculate the 3D structure features of the substitution site

Secondary structure Solvent accessible surface area

Φ Ψ dihedral angles

Normalized B‐factor for the residue Loss of hydrogen bond Contacts with critical sites, ligands or other polypeptide chains

PolyPhen uses empirically derived rules to predict whether an nsSNP is damaging or benign

Cons

If 3D structure is not available, it can only depend on MSA.

The rules are empirical.

PolyPhen Pros

Improved prediction accuracy when protein 3D structure is available

PolyPhen2

An improved version of PolyPhen in 2010 http://genetics.bwh.harvard.edu/pph2/

Use more predictive features Based on Naïve Bayes machine learning

Improved performance compared with PolyPhen

Unit 4: Classifier-based methods: SAPRED


Formulate as a supervised classification problem

+ ‐

Structural attributes & Sequence attributes Apply the classifier to newly identified SAPs

Attributes evaluation & Subset selection 60 attributes 10 groups Build SVM classifier On training data

Single Amino acid Polymorphisms disease‐association Predictor (SAPRED)

Currently SAPRED supports two types of predictions: One is based on both the structural and sequence information the other relies on the sequence information only The former aims at higher prediction accuracy and more attributes with putative biological insights, while the latter can work with more queries whose structural models are not available.

PDB – get protein 3D structure http://www.rcsb.org/pdb/home/home.do

Homology Modeling

http://swissmodel.expasy.org/

Homology Modeling

Biologically-Intuitive Attributes

Residue frequencies, conservation score,

Solvent accessibilities and Cβ density, secondary structure...

New attributes:

Structural neighbor profile

Nearby functional sites

Disordered regions

Hydrogen bonds change

β-aggregation

HLA family

Residue frequencies in MSA

LacI 5-38

NR,ai X j

where Xj j i Xj,c < R;


Definition:

A 20-D vector: take the Cα of the SAP residue as the center, draw a sphere with a specific radius. The residues inside are counted to get the number for each of the 20 kinds of residues. Each number is a component of the vector.

R: radius

L: protein length

ai: a specific residue type

r: distance between a

residue and the center residue

L j1

=1 if X = a & r

otherwise, Xj = 0


The center is H128, radius is 10 Angstroms. Neighbors are: 42-47: LLICTY

50-52: AGT 55: I 59: V

106-110: LKTHL 112: T

125-127: KFL

129-131: VAR 176-177: HV 180-181: WW 184: K

188-194: QILFLFY 197: I 208: V 211: F

a.a. A C D E F G H I K L

N 2 1 0 0 4 1 2 4 3 7

a.a. M N P Q R S T V W Y

N 0 0 0 1 1 0 4 4 2 2

Structural neighbor profile: vector

Ov

eral

l ac

cura

cy


Predictive power of different structural neighbor profile

0.68 0.66

0.76 0.74 0.72 0.7

0.78

0 5 10 15 20

Radius (Å)

wildtype profile

variant profile

profile difference

Different radius had different prediction power.

We selected 13 Angstroms as the optimal value of the radius.


Functional sites like ACT_SITE, METAL annotated in Swiss-Prot have intuitive biological insights

SAPs exactly on these sites would disturb protein function heavily but only low coverage in the dataset.

We proposed the SAPs in the vicinity of functional sites could also affect the protein function more probably than others – enlarged the coverage of these attributes in the dataset.

Disordered Region

122 SAPs in disordered regions, 114 (93%) are disease-associated.

From: http://ist.temple.edu/disprot/index.php

Changed

Hydrogenbond

Disease Polymorphism ratio

-6 1 0 1/0

-5 12 1 12

-4 44 2 22

-3 114 16 7.25

-2 230 55 4.18

-1 403 213 1.89

0 1142 716 1.59

1 224 142 1.58

2 68 36 1.89

3 11 4 2.75

4 0 2 0

5 0 2 0

Hydrogen bond change

Other attributes

52 SAPs in transmembrane regions, 49 (94%) are disease-

associated

194 SAPs altered β-aggregation properties, 169 (87%) are

disease-associated

435 SAPs from HLA families, all except one are “polymorphism”.

SVM classifier SVM – support vector machine Separate transformed data with a hyper plane in a high‐dimensional space

Kernel function – Radial Basis Function(RBF)

Grid‐search to select proper values of parameter

Support Vector Machine (SVM) Classifier -- Grid-search for parameters

log2C = 1; log2g = -7

Five-fold cross-validation

Part Total proteins Total SAP Deleterious

SAP

Neutral SAP

1

2

3

4

5

Total

105

104

105

105

103

522

686

688

688

688

688

3438

449

450

450

450

450

2249

237

238

238

238

238

1189

SAPstatus Predictedasdisease-

association(+)

Predictedas

polymorphism(-)

Disease-association(+) TP FN

Polymorphism(-) FP TN

Accuracy: ACC and MCC

ACC TPTN

TPTNFPFN

(TPTN FPFN)

(TN FN)(TN FP)(TP FN)(TP FP)

Overall accuracy:

Matthew correlation

coefficient:

MCC

Predictive power

SAPRED web server

http://sapred.cbi.pku.edu.cn/

Run SAPRED

Results

Explanation of Results: Structural attributes

Explanation of Results: sequence attributes

Results using SAPRED_Seq

ACC=81.5% MCC=0.577

Unit 5:

Support Vector Machine(SVM) Le Zhang, Ph. D.

Computer Science Department Southwest University

……

Decision tree Neural Network Random Forest Ensemble learning

Model

Prediction

Training Data

New Data

Var1 Var2

Var3 VarN

Peking University

Machine learning model Methods SVM HMM Bayesian

Peking University

Classification Classifying data is a common task in machine learning. Suppose some given data points each belong to one of two classes, and the goal is to decide which class a new data point will be in.

Peking University

Introduction SVM is supervised learning model that analyze data and recognize patterns, used for classification and regression analysis. It selects a small number of critical boundary instances called support vectors from each class and build a linear discriminant function that separates them as widely as possible. SVMs can efficiently perform non‐linear classification using what is called the kernel trick, implicitly mapping their inputs into high‐dimensional feature spaces.

Consider a two‐class, linearly separable classification problem Many decision boundaries! Are all decision boundaries equally good?

Peking University

What is a good Decision Boundary?

Peking University

Decision Boundary

Intuitively, the best hyperplane is the one that represents the largest separation, or margin, between the two classes, since the larger the margin is, the lower the generalization error of the classifier will be.

Peking University

Support Vector The instances that are closest to the maximum‐margin hyperplane—the ones with the minimum distance to it—are called support vectors.

is the 1 or ‐1 to represent

y 1, 1,

0 0

Peking University

SVM - mathematics The data point is donated by , which is a n dimension vector, and the two different class. The hyperplane is 0 So the classification function is And

and y . And in fact, f x y . So functional margin is:

The functional margin of a hyperplane is measured by

min

Peking University

SVM - mathematics The confidence of a classification can be measured by the functional margin, which is |f x |, and whether the classification is right can be determined by the consistence of signs of f x

However, the functional margin can be scaled even if the hyperplane remain the same, for example, w and b changed into 2w and 2b.

r f x

| |

| |

In this maximum margin classifier, we want to max . Because the functional margin is scalable,

we can assume 1 without influence the optimal result.

Peking University

SVM - mathematics

A intuitional measurement can be obtained using the distance from the point to the hyperplane, which is called geometrical margin

max 1

| | . . 1 , 1,2,…, .

Which equals to

min 1

2 . . 1 , 1,2,…, .

This is a optimization model with constraints, and can be easily solve by Quadratic Programming.

Peking University

SVM - mathematics So the objective function is

L w,b,α 1

2 1

L

w L

b

0 0

0

Peking University

SVM - mathematics We can also solve this by Lagrange multipliers

f x

,

Peking University

SVM - mathematics Finally the classification function can be rewritten as

Peking University SVM - kernel The linear learning machine has very limited ability in practice, because of complexity in the real world, which needs more flexible hypothetical space. We can use a function ϕ to map x to a higher dimension space, in which all the points can be linear separable.

,

Here we get the kernel function:

K x,z ,

Peking University

kernel

So the classification function can be extended as

0 a

The we can construct a 5‐dimension space, where

Z , , , ,

So the hyperplane in the new feather space is

0

Peking University

kernel Take points in the picture for example, the two classes can be separated by a circle

Linear kernel: K x ,x ,

, Polynomial kernel: K x ,x

Gauss kernel: K x ,

Peking University

Kernel function

Gauss kernel

Peking University

SVM - example Linear kernel

Peking University

Applications SVM has been used successfully in many real‐world problems bioinformatics (Mutation classification, Cancer classification) text (and hypertext) categorization image classification – different types of sub‐problems hand‐written character recognition

Peking University

Pros and Cons With support vectors, the maximum‐margin hyperplane is relatively stable. However, they often produce very accurate classifiers because subtle and complex decision boundaries can be obtained. Compared with other methods, even the fastest training algorithms for support vector machines are slow when applied in the nonlinear setting.

Unit 6: Comparative Protein Structure Modeling

of Genes And Genomes Le Zhang, Ph. D.

Computer Science Department Southwest University

Catalogue

•

•

•

•

What is comparative protein structure modeling? Why could we do comparative modeling?

Why is comparative modeling important?

How to do comparative modeling?

Fold assignment and template selection

Target – template alignment

Model building

Model evaluation

• The application of comparative modeling

• Comparative modeling in structural genomics

1. What Is Comparative Protein Structure Modeling?

• Comparative protein structure modeling predicts the three‐ dimensional structure for a given protein sequence of unknown structure (target) on the basis of sequence similarity to proteins of known structure (the templates).

2. Why Could We Do Comparative Modeling?

• Small changes in the protein sequence usually result in small changes in its 3D structure. If similarity between two proteins is detectable at the sequence level, structural similarity can usually be assumed.

• The number of unique structural folds that proteins adopt is limited and because the number of experimentally determined new structures is increasing exponentially.

• Designing mutants to test hypotheses about a protein’s function

• Identifying active and binding

• Identifying, designing and improving ligands for a given binding site

• Modeling substrate specificity

• Predicting antigenic epitopes

• Facilitating molecular replacement in x‐ray structure determination

• Refining models based on NMR constraints

• Testing and improving a sequence‐structure alignment

• Confirming a remote structural relationship

• Rationalizing known experimental observations.

3. Why Comparative Modeling Is Important?

• It is an efficient way to obtain useful information about the proteins of interest.

• Simulating protein–protein docking

• Inferring function from a calculated electrostatic potential around the protein

4. How To Do Comparative Modeling?

• Fold assignment and template selection

• Target – template alignment

• Model Building

• Model evaluation

• Three main classes of protein comparison methods :

1. Comparing the target sequence with each of the database sequences independently. Program : BLAST, FASTA etc.

2. Using multiple sequence comparisons to improve the sensitivity of the search. Program : PSI‐BLAST etc.

*especially useful when the sequencing identity below 25%

3. Threading or 3D template matching methods. *especially useful when there are no sequences clearly related to the modeling target.

4.1 Fold Assignment And Template Selection


• Template selection :

A higher sequence similarity, The family of proteins, The quality of template structure, Solvent, pH, ligands…

• Potential problems:

Distantly related proteins used as templates (i.e., less than 25% sequence identity) may produce an unreliable model.


• The databases and Programs you may use in this step:

a S, server , P, program b Some of the sites are mirrored on additional computers

C (a) MolSoft Inc., San Diego. (b) Molecular Simulations

Inc., San Diego. (c) Tripos Inc., St Louis. (d) ProCeryon Biosciences Inc. New York.

• Once templates have been selected, a specialized method should be used to align the target sequence with the template structures. Program : CLUSTAL etc.

• The alignment becomes difficult in the “twilight zone” of less than 30% sequence identity. (Only 20% of the residues are likely to be correctly aligned when two proteins share 30% sequence.)

4.2 Target – Template Alignment

Similarity of BLOSUM62 is 62%, also ~45 & ~80.


• In difficult cases, it is frequently beneficial to rely on multiple structure and sequence information. The information from structures helps to avoid gaps in secondary structure elements, in buried regions, or between two residues that are far in space.

• Potential problems: Although you can use the methods aforementioned, misalignment may occur especially when the target‐template sequence identity decreases below 30%.


• Programs and World Wide Web servers you may use in this step:


C (a) MolSoft Inc., San Diego. (b) Molecular Simulations Inc., San Diego. (c) Tripos Inc., St Louis. (d) ProCeryon Biosciences Inc. New York.

4.3 Model Building

• Three classes of methods can be used to construct a 3D model:

1. Modeling by Assembly of Rigid(刚性的) Bodies

Assemble a model from a small number of rigid bodies obtained from aligned protein structures.

2. Modeling by Segment Matching or Coordinate Reconstruction

Use a subset of atomic positions from template structures as “guiding” positions, and by identifying and assembling short, all‐atom segments that fit these guiding positions.

3. Modeling by Satisfaction of Spatial(空间的) Restraints(约束) Generate many constraints or restraints on the structure of the target sequence, using its

alignment to related protein structures as a guide.

4.3 Model Building




4.3.1 Loop Modeling

• Loops often determine the functional specificity of a given protein framework. They contribute to active and binding sites.

• Loop modeling can be seen as a mini–protein folding problem, but they are generally too short to provide sufficient information about their local fold.

• Three methods:

1) Ab initio methods

2) Database search techniques 3) Both

4.3.2 Sidechain Modeling • Side chain conformations are predicted from similar structures and from steric(立体的) or energetic considerations. • They are modeled using structural information from proteins in general and from equivalent disulfide(二硫) bridges in related structures. • Two effects on sidechain conformation: 1) The coupling between the main chain and side chains

2) The continuous nature of the distributions of side‐chain dihedral angles(二面角)

• Three different side‐chain prediction methods : 1)The packing of backbone‐dependent rotamers(旋转异构体) 2)The self‐consistent mean‐field approach to positioning rotamers based on their van der Waals interactions 3)The segment‐matching method of Levitt

4.3.3 Potential Problems

• According to a recent survey analyzed the accuracy of 3 modeling methods, they can only correctly predict approximately 50% of χ1 angles and 35% of both χ1 and χ2 angles.

• Segments of the target sequence that have no equivalent region in the template structure (i.e., insertions or loops) are the most difficult regions to model, especially when the insertion is more than 9 residues long.

• Some correctly aligned segments of a model, the template is locally different (<3 A˚) from the target, resulting in errors in that region.

• As the sequences diverge, the packing of side chains in the protein core may changes.

4.4 Model Evaluation

• Typical errors in comparative models :

1. Errors in side‐chain packing 2. Distortions and shift in correctly aligned regions. 3. Errors in regions without a template 4. Errors due to misalignments 5. Incorrect template.


• Typical errors in comparative models :


The criteria of evaluation

Having the correct fold or not

The target‐template sequences similarity

Distributions of many spatial features

The environment

Having good stereochemistry or not


1) Having the correct fold or not A model will have the correct fold if the correct template is picked and if that template is aligned at least approximately correctly with the target sequence. A

The fold of a model can be assessed by a high sequence similarity with the closest template, an energy based Z‐score, or by conservation of the key functional or structural residues in the target sequence.

2) The target‐template sequences similarity Sequence identity above 30% is a relatively good predictor of the expected accuracy.

Average model accuracy as a function of the template‐target sequences similarity


EDN: human eosinophil neurotoxin, is a ribonuclease with 3 α-

helices and 2 three-stranded antiparallel β-sheets arranged in a

single domain.

CRABPI: mouse cellular retinoic acid binding protein I, is a single domain protein composed of interacting α‐helices packed at the edge of two orthogonal, 4‐ and 6‐stranded antiparallel β‐sheets. For the CRABPI

model, 90% of Cαatoms superpose within 3.5 Å of their counterparts in the X‐ray structure; the rms error is 1.31 Å.

NM23H2: Human nucleoside diphosphate kinase, is a single

domain protein consisting of a central 4-stranded antiparallel β-

sheet surrounded by 8 α-helices. For the NM23H2 model, all but

one Cαatom superpose within 3.5 Å of the X-ray structure; rms difference is 0.41 Å.

Solid line: sample models Dotted line: corresponding actual structures

4.4 Model Evaluation Average model accuracy as a function of the template‐target sequences similarity Percentage structure overlap is defined as the

fraction of equivalent residues. Two residues are equivalent when their Cα atoms are within 3.5 Å of each other upon rigid‐body, least‐squares superposition of the two structures.

3) The environment

Example: some calcium‐binding proteins undergo large conformational changes when bound to calcium. If a calcium‐free template is used to model the calcium‐bound state of the target, it is likely that the model will be incorrect.

4) Having good stereochemistry or not

Including bond lengths, bond angles, peptide bond and side‐chain ring planarities, chirality, main‐chain and side‐chain torsion angles, and clashes between nonbonded pairs of atoms.

5) Distributions of many spatial features

Such features include packing, formation of a hydrophobic core, residue and atomic solvent acces sibilities, spatial distribution of charged groups, distribution of atom‐atom distance, atomic volumes, and main‐chain hydrogen bondin.



• There are also methods for testing 3D models that implicitly take into account many of the criteria listed above. These methods are based on 3D profiles and statistical potentials of mean force.

• A physics‐based approach to deriving energy functions has been tested for use in protein structure evaluation (1999).

Low accuracy <30% sequence identity Less than 50% of their Cα

atoms within 3.5 Å of their correct positions

High accuracy >50% sequence identity Approaches that of low

resolution X‐ray structures or medium resolution NMR structures rw (van der Waals radius) of C atom = 1.70Å

5. The Application of Comparative Modeling

• Three levels of model accuracy and some of the corresponding applications

Three levels

Middle aaccuracy 30‐50% sequence identity 85% of their Cα atoms within 3.5 Å of their correct positions


• Applications1: low accuracy models

• •

<30% sequence identity, having the correct fold Less than 50% of their Cα atoms within 3.5 Å of their correct

positions

• Use: To confirm or reject a match between remotely related proteins


• 30‐50% sequence identity

• 85% of their Cα atoms within 3.5 Å of their correct positions

• Use: Refinement of the functional prediction based on sequence to construct site‐directed mutants with altered or destroyed binding capacity other problems...

• Applications2: middle accuracy models


• Applications3: high accuracy models

• >50% sequence identity • The average accuracy of these models approaches that of low resolution X‐ray structures (3 Å resolution) or medium resolution NMR structures (10 distance restraints per residue) • s

• Use: For docking of small ligands or whole proteins onto a given protein.

6. Comparative modeling in structural genomics

• The aim of structural genomics is to determine or accurately predict the 3D structure of all the proteins encoded in the genomes.

• This aim will be achieved by a focused, large‐scale determination of protein structures by X‐ray crystallography and NMR spectroscopy, combined efficiently with accurate protein structure modeling techniques.


• For comparative modeling to contribute to structural genomics, automation of all the steps in the modeling process is essential.

• The automation of large‐scale comparative modeling involves assembling a software pipeline that consists of modules for fold assignment, template selection, target–template alignment, model generation, and model evaluation.

• Two examples of large‐scale comparative modeling for complete genomes:

the SWISS‐MODEL web server: The sequences encoded in the E. coli genome have been used to build models for 10–15% of the proteins using the SWISS‐MODEL web server.

MODPIPE: MODPIPE produced models for five procaryotic and eukaryotic genomes. This calculation resulted in models for substantial segments of 17.2%, 18.1%, 19.2%, 20.4%, and 15.7% of all proteins in the genomes of Saccharomyces cerevisiae (6218 proteins in the genome); Escherichia coli (4290 proteins), Mycoplasma genitalium (468 proteins), Caenorhabditis elegans (7299 proteins, incomplete), and Methanococcus janaschii (1735 proteins).


• Large‐scale comparative modeling will extend opportunities to tackle a myriad of problems by providing many protein models for many genomes.

Rotein evolution Drug design

A facile comparison of ligand binding requirements and Substitutions in and around important residues ......

A specific example:

The selection of a target protein for drug development !


7. Conclusion • Over the past few years, there has been a gradual increase in both the accuracy of comparative models and the fraction of protein sequences that can be modeled with useful accuracy. • Further advances are necessary in recognizing weak sequence–structure similarities, aligning sequences with structures, modeling of rigid body shifts, distortions, loops and side chains, as well as detecting errors in a model. • It is currently possible to model with useful accuracy significant parts of approximately one third of all known protein sequences. • A major new challenge for comparative modeling is the integration of it with the torrents of data from genome sequencing projects as well as from functional and structural genomics.

Reference • Martí‐Renom M A, Stuart A C, Fiser A, et al. Comparative protein structure modeling of genes and genomes[J]. Annual review of biophysics and biomolecular structure, 2000, 29(1): 291‐325. • Šali A, Potterton L, Yuan F, et al. Evaluation of comparative protein modeling by MODELLER[J]. Proteins: Structure, Function, and Bioinformatics, 1995, 23(3): 318‐326. • Fiser A, Do R K G, Šali A. Modeling of loops in protein structures[J]. Protein science, 2000, 9(9): 1753‐1773. • Fiser A, Do R K G, Šali A. Modeling of loops in protein structures[J]. Protein science, 2000, 9(9): 1753‐1773. • Sánchez R, Šali A. Comparative protein structure modeling in genomics[J]. Journal of Computational Physics, 1999, 151(1): 388‐401.

Bioinformatics: Introduction and Methods

Computer Science Department, Southwest University

Thank you

bioinformatics: introduction and methods

Documents