bioinformatics: introduction and methods

147
Bioinformatics: Introduction and Methods Le Zhang Computer Science Department, Southwest University

Upload: others

Post on 15-Oct-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bioinformatics: Introduction and Methods

Bioinformatics: Introduction and Methods Le Zhang

Computer Science Department, Southwest University

Page 2: Bioinformatics: Introduction and Methods

Functional prediction of genetic variants

Le Zhang, Ph. D. Computer Science Department Southwest University

Page 3: Bioinformatics: Introduction and Methods

Unit 1: Overview of the problem

Le Zhang, Ph. D. Computer Science Department Southwest University

Page 4: Bioinformatics: Introduction and Methods

Do you think Angelina made the right decision to remove her breasts?

Page 5: Bioinformatics: Introduction and Methods

Angelina Joli has a genetic mutation in BRCA1.

How can we predict the likelihood of her getting breast cancer given this mutation? • P(breast cancer|her mutation) • P(breast cancer free|her mutation)

Page 6: Bioinformatics: Introduction and Methods

The dawning of the age of personalized medicine Next‐generation sequencing can sequence one person’s whole genome with ~$3000.

The personal genomes hold promises for a future of personalized medicine.

Page 7: Bioinformatics: Introduction and Methods

Where did your genetic variations come from?

somatic mutations de novo mutations inherited from parents

Annapurna Poduri et. al. Somatic Mutation, Genomic Variation, and Neurological Disease Science 5 July 2013: 341

Page 8: Bioinformatics: Introduction and Methods

Types of genetic variations in a human genome

• Chromosomal aneuploidy • Structural Variations (SVs) • Copy Number Variations (CNVs) • Short insertion/deletions (indels) • Single Nucleotide Variations (SNVs)

Nomenclature: Mutation vs. polymorphism vs. variation vs. variant

Page 9: Bioinformatics: Introduction and Methods

Structure Variation (SV) and Copy Number Variation (CNV) Insertion Deletion Inversion Translocation CNV

Page 10: Bioinformatics: Introduction and Methods

Indel – short Insertion/Deletion Within intergenic/intronic regions Within coding regions

Frameshifting Non‐frameshifting x

Page 11: Bioinformatics: Introduction and Methods

SNV – Single Nucleotide Variation There are about 3 million SNVs in one person’s genome, equivalent of ~ 1/1000 frequency.

Page 12: Bioinformatics: Introduction and Methods

SNVs within coding regions

Stop gain(nonsense)

Stop loss

Non‐synonymous(missense)

Synonymous(silent)

Affect splicing Missense mutation Nonsense mutation

Page 13: Bioinformatics: Introduction and Methods

Missense (nonsynonymous) SNVs

Missense SNVs change the amino acid.

Missense SNVs account for ~2% of the genome but >50% of all mutations known to be

involved in human inherited diseases.

Page 14: Bioinformatics: Introduction and Methods

BRCA1 vs. breast cancer

In 1990, DNA linkage studies on large families identified BRCA1 as the first gene associated with

breast cancer. BRCA1 located on chromosome 17 80,818 bp in length 23 exons encodes a protein of 1,863 amino acids a tumor suppressor gene that repairs damaged DNA and regulates cell growth and cell death. Approximately 5‐10% of breast cancers and 14% of ovarian cancers occur from a BRCA1 or BRCA2 genetic mutation.

Page 15: Bioinformatics: Introduction and Methods

However, not all missense SNVs cause phenotype change. Some are pathogenic, but many are neutral. Atotal of 238 known missense variations in BRCA1

163 are present only in patients

62 are present only in healthy persons

13 in both patients and healthy persons

Page 16: Bioinformatics: Introduction and Methods

On average, a healthy individual has

Class

Synonymous SNPs

Non‐synonymous SNPs

Small in‐frame indels

Small frameshift indels

Stop losses

Stop‐introducing SNPs

Genes disrupted by large deletions

Total genes containing LOF variants

HGMD ‘damaging mutation’ SNPs

Number

60,157

68,300

714

954

77

1,057

147

2,304

671

Class

SNP

Number

3,019,909

Indel

Deletions

Duplications

mobile element

insertions

361,669

15,893

407 4,775

Within protein‐coding regions,

Page 17: Bioinformatics: Introduction and Methods

Still an unsolved problem with lots of active on‐going research!

• What features differentiate disease‐causing variants from neutral ones? • How can we predict whether a variation is disease‐causing?

Page 18: Bioinformatics: Introduction and Methods

Unit 2: Databases of genetic variations

Le Zhang, Ph. D. Computer Science Department Southwest University

Page 19: Bioinformatics: Introduction and Methods
Page 20: Bioinformatics: Introduction and Methods

dbSNP

http://www.ncbi.nlm.nih.gov/SNP/

Created in September 1998 by by the

NCBI(National Center for Biotechnology Information) in collaboration with the NHGRI(National Human Genome Research Institute)

Its goal is to act as a single database

that contains all identified genetic variation

Page 21: Bioinformatics: Introduction and Methods

232,952,851 62,676,337 44,278,189 27,608,151 73,909,251 35,997,830

dbSNP New information obtained by dbSNP becomes available to the public periodically in a series of “builds”

Contains a range of molecular variation: SNPs Indels

multinucleotide polymorphisms microsatellite markers short tandem repeats heterozygous sequences

As of dbSNP build 138: Consist of variants from131 Organisms For Homo sapiens

Number of Submissions (ss) Number of RefSNP clusters (rs) Validated rs Number of rs in gene Number of ss with genotype Number of ss with frequency

Page 22: Bioinformatics: Introduction and Methods

dbSNP– Data increase From dbSNP build 125 in 2005 to build 138 in 2013, for Homo sapiens 250,000,000

200,000,000

150,000,000

100,000,000

50,000,000

0 2005 2007 2008 2009 2011 2012

Number of Submissions(ss)

Number of rs in gene Number of RefSNP Clusters(rs)

Page 23: Bioinformatics: Introduction and Methods

dbSNP- Record

Page 24: Bioinformatics: Introduction and Methods

dbSNP- Record

Page 25: Bioinformatics: Introduction and Methods

1000 Genomes http://www.1000genomes.org/ The 1000 Genomes Project, launched in January 2008, is an international research effort to establish by far the most detailed catalogue of human genetic variation. Pilot‐ In 2010, the project finished its pilot phase Phase I ‐ In October 2012, the sequencing of 1092 genomes was announced in a Nature publication

Page 26: Bioinformatics: Introduction and Methods

1000 Genomes

Page 27: Bioinformatics: Introduction and Methods

1000 Genomes

Sequencing technology used:

Illumina SOLID 454

Phase I Whole genome Whole exome

strategy Low coverage whole genome sequencing

Deeping sequencing of whole

exome

Coverage 2‐6X 50‐100X

Sample number

1,092 1,039

Page 28: Bioinformatics: Introduction and Methods

OMIM Online Mendelian Inheritance in Man A database catalogues all the known diseases with a genetic component, and links them to the relevant genes in the human genome Contain information on all known mendelian disorders and over 12,000 genes.

http://www.omim.org/

Page 29: Bioinformatics: Introduction and Methods

OMIM initiated in the early 1960s by Dr. Victor A. McKusick as a catalog of mendelian traits and disorders, entitled Mendelian Inheritance in Man as a book 12 book editions of MIM were published between 1966 and 1998

The online version, OMIM, was created in 1985 and made generally available on the internet starting in 1987.

Page 30: Bioinformatics: Introduction and Methods

OMIM Entry Statistics

Page 31: Bioinformatics: Introduction and Methods

OMIM

Page 32: Bioinformatics: Introduction and Methods

Human Gene Mutation Database (HGMD)

a comprehensive collection of germline mutations in nuclear genes that underlie,

or are associated with, human inherited disease.

By 2013, the database contained over 141,000 different variants detected in over

5,700 different genes

Two versions: Professional – need subscription every year Public – freely available but permanently 3 years out of date, and does not contain any of the additional annotations or extra features present in HGMD Professional

Page 33: Bioinformatics: Introduction and Methods

Human Gene Mutation Database (HGMD)

Created by biologist David N. Cooper and mathematician Michael

Krawczak in 1996.

Originally established for the scientific study of mutational mechanisms

in human genes causing inherited disease, but has since acquired a much broader utility as a central unified repository for germ‐line disease‐related functional variation.

All HGMD mutation data are manually curated from the scientific

literature.

Page 34: Bioinformatics: Introduction and Methods

HGMD

Page 35: Bioinformatics: Introduction and Methods

HGMD 2013.2

Page 36: Bioinformatics: Introduction and Methods

HGMD http://www.hgmd.cf.ac.uk/ac/index.php

Page 37: Bioinformatics: Introduction and Methods

Locus specific databases (LSDBs)

Collect all known variants of each disease related gene in a specific database

Annotate with Complete and accurate information on genetic mutations

Most LSDBs are build based on LOVD (Leiden Open Variation Database) which is a database framework of storing variants information

http://www.lovd.nl/3.0/home

Page 38: Bioinformatics: Introduction and Methods

LSDBs

Page 39: Bioinformatics: Introduction and Methods

Unit 3: Conservation-base and Rule-based

methods: SIFT & PolyPhen

Le Zhang, Ph. D. Computer Science Department Southwest University

Page 40: Bioinformatics: Introduction and Methods

Questions:

• What features differentiate disease‐causing variants from neutral ones?

• How can we predict whether a variation is disease‐causing?

Page 41: Bioinformatics: Introduction and Methods

Phenotypical/functional “effects” of human genetic variations

• Disease vs. normal • Deleterious vs. neutral

• Personal trait differences (e.g., height)

Observations, not “truth”

Statistical and stochastic, not deterministic

• Animal model phenotypic changes • Cellular phenotypic changes

• Protein function changes

• Protein structure changes

• Protein sequence changes

Page 42: Bioinformatics: Introduction and Methods

• Nonsense mutations are usually considered deleterious. • even though it is not always the case…

• Known deleterious mutations are enriched in nonsynonymous mutations. • ~50 known mutations of Mendelian disorders are nonsynonymous mutations

• ascertainment bias?

• synonymous mutations, intronic mutations, and intergenic mutations are under‐ studied. • According to GWAS studies, 88% of trait‐associated variants of weak effect are non‐coding.

• Most research so far had focused on nonsynonymous mutations.

Page 43: Bioinformatics: Introduction and Methods

1999: Earliest attempt based on BLOSUM substitution matrix

• Assumption: if the substitution score between a variant residue and the wild type residue is positive, then the variant is neutral. If the substitution score is negative, then the variant is deleterious.

Page 44: Bioinformatics: Introduction and Methods

More successful methods

• Conservation‐based (e.g., SIFT)

• Rule‐based (e.g., PolyPhen)

• Classifier‐based (e.g., PolyPhen2, SAPRED)

Page 45: Bioinformatics: Introduction and Methods

Sort Intolerant From Tolerant substitutions (SIFT)

Published in 2001 by Pauline C. Ng and Steven Henikoff The first tool of predicting deleterious Amino Acid Subsitutions Website: http://sift.jcvi.org/

Page 46: Bioinformatics: Introduction and Methods

SIFT bets on evolution Important positions (such as active sites) tend to be conserved in the protein family across species. • Mutations at well‐conserved positions tend to be deleterious.

Some positions have a high degree of diversity across species. • Mutations at these positions tend to be neutral.

Page 47: Bioinformatics: Introduction and Methods

SIFT is a multistep procedure

Given a protein sequence:

Step 1. Search for similar sequences

Sequence search database: SWISS‐PROT

PSI‐blast is run for four iterations to collect a pool of sequences similar to the query

Step 2. Choose closely related sequences that are likely to share similar function

The psi‐blast results are grouped together if they are >90% identical in the regions aligned

Page 48: Bioinformatics: Introduction and Methods

Step 3. Obtain the multiple alignment of these chosen sequences

Page 49: Bioinformatics: Introduction and Methods

Step 4. Calculate normalized probabilities for all possible substitutions at each position at the alignment

If the SIFT score is less than 0.05, the SNV is considered to be deleterious. Otherwise, it is considered neutral.

Page 50: Bioinformatics: Introduction and Methods
Page 51: Bioinformatics: Introduction and Methods

Prediction results

Score cutoff: 0.05

Page 52: Bioinformatics: Introduction and Methods

Accuracy of SIFT False Negative rate: 31% False Positive rate: 20% Coverage: 60%

Page 53: Bioinformatics: Introduction and Methods

Truth("Goldstandard")

Positive Negative

Test

Outcome

Positive TruePositive

(hit)

FalsePositive (falsealarm)

Positivepredictivevalue

(PPV)=

Precision=

TP/(TP+FP)

Negative FalseNegative

(miss)

TrueNegative (correctrejection)

Negativepredictivevalue

(NPV)=

TN/(TN+FN)

Sensitivity=

Recall=

TP/(TP+FN)

Specificity=

TN/(TN+FP)

Accuracy=

(TP+TN)/total

Falsenegativerate

(β)=

TypeIIerror=

1-sensitivity=

FN/(TP+FN)

Falsepositiverate

(α)=

TypeIerror=

1-specificity=

FP/(TN+FP)

Falsediscoveryrate

(FDR)=

1-precision=

FP/(TP+FP)

Page 54: Bioinformatics: Introduction and Methods

Polymorphism Phenotyping (PolyPhen): a rule‐based method Amino acid variants may impact folding, interaction sites, solubility or stability of the protein.

Changes in protein structure may affect protein function, which may lead to phenotype change.

PolyPhen predicts impact of amino acid allelic variants based on multi‐sequence alignment AND protein 3D structure features

Page 55: Bioinformatics: Introduction and Methods

PolyPhen

Page 56: Bioinformatics: Introduction and Methods

PolyPhen

1. Multi‐sequence alignment of homologous sequences

2. Structure‐based characterization of the substitution site DISULFIDE, THIOLEST or THIOEATH bond, BINDING site, ACTIVE site etc. Whether the variant is located in transmembrane regions Whether the variant is located in coiled coil regions Whether the variant is located in signal peptide regions

Page 57: Bioinformatics: Introduction and Methods

PolyPhen 3. Get the protein 3D structure or using homolog modeling to predict its structure 4. Calculate the 3D structure features of the substitution site

Secondary structure Solvent accessible surface area

Φ Ψ dihedral angles

Normalized B‐factor for the residue Loss of hydrogen bond Contacts with critical sites, ligands or other polypeptide chains

Page 58: Bioinformatics: Introduction and Methods

PolyPhen uses empirically derived rules to predict whether an nsSNP is damaging or benign

Page 59: Bioinformatics: Introduction and Methods

Cons

If 3D structure is not available, it can only depend on MSA.

The rules are empirical.

PolyPhen Pros

Improved prediction accuracy when protein 3D structure is available

Page 60: Bioinformatics: Introduction and Methods

PolyPhen2

An improved version of PolyPhen in 2010 http://genetics.bwh.harvard.edu/pph2/

Use more predictive features Based on Naïve Bayes machine learning

Page 61: Bioinformatics: Introduction and Methods

Improved performance compared with PolyPhen

Page 62: Bioinformatics: Introduction and Methods

Unit 4: Classifier-based methods: SAPRED

Le Zhang, Ph. D. Computer Science Department Southwest University

Page 63: Bioinformatics: Introduction and Methods

Formulate as a supervised classification problem

+ ‐

Structural attributes & Sequence attributes Apply the classifier to newly identified SAPs

Attributes evaluation & Subset selection 60 attributes 10 groups Build SVM classifier On training data

Page 64: Bioinformatics: Introduction and Methods

Single Amino acid Polymorphisms disease‐association Predictor (SAPRED)

Currently SAPRED supports two types of predictions: One is based on both the structural and sequence information the other relies on the sequence information only The former aims at higher prediction accuracy and more attributes with putative biological insights, while the latter can work with more queries whose structural models are not available.

Page 65: Bioinformatics: Introduction and Methods

PDB – get protein 3D structure http://www.rcsb.org/pdb/home/home.do

Page 66: Bioinformatics: Introduction and Methods
Page 67: Bioinformatics: Introduction and Methods

Homology Modeling

http://swissmodel.expasy.org/

Page 68: Bioinformatics: Introduction and Methods

Homology Modeling

Page 69: Bioinformatics: Introduction and Methods

Biologically-Intuitive Attributes

Residue frequencies, conservation score,

Solvent accessibilities and Cβ density, secondary structure...

New attributes:

Structural neighbor profile

Nearby functional sites

Disordered regions

Hydrogen bonds change

β-aggregation

HLA family

Page 70: Bioinformatics: Introduction and Methods

Residue frequencies in MSA

LacI 5-38

Page 71: Bioinformatics: Introduction and Methods

NR,ai X j

where Xj j i Xj,c < R;

Structural neighbor profile

Definition:

A 20-D vector: take the Cα of the SAP residue as the center, draw a sphere with a specific radius. The residues inside are counted to get the number for each of the 20 kinds of residues. Each number is a component of the vector.

R: radius

L: protein length

ai: a specific residue type

r: distance between a

residue and the center residue

L j1

=1 if X = a & r

otherwise, Xj = 0

Page 72: Bioinformatics: Introduction and Methods

Structural neighbor profile

The center is H128, radius is 10 Angstroms. Neighbors are: 42-47: LLICTY

50-52: AGT 55: I 59: V

106-110: LKTHL 112: T

125-127: KFL

129-131: VAR 176-177: HV 180-181: WW 184: K

188-194: QILFLFY 197: I 208: V 211: F

Page 73: Bioinformatics: Introduction and Methods

a.a. A C D E F G H I K L

N 2 1 0 0 4 1 2 4 3 7

a.a. M N P Q R S T V W Y

N 0 0 0 1 1 0 4 4 2 2

Structural neighbor profile: vector

Page 74: Bioinformatics: Introduction and Methods

Ov

eral

l ac

cura

cy

Structural neighbor profile

Predictive power of different structural neighbor profile

0.68 0.66

0.76 0.74 0.72 0.7

0.78

0 5 10 15 20

Radius (Å)

wildtype profile

variant profile

profile difference

Different radius had different prediction power.

We selected 13 Angstroms as the optimal value of the radius.

Page 75: Bioinformatics: Introduction and Methods

Nearby functional sites

Functional sites like ACT_SITE, METAL annotated in Swiss-Prot have intuitive biological insights

SAPs exactly on these sites would disturb protein function heavily but only low coverage in the dataset.

We proposed the SAPs in the vicinity of functional sites could also affect the protein function more probably than others – enlarged the coverage of these attributes in the dataset.

Page 76: Bioinformatics: Introduction and Methods

Nearby functional sites

Page 77: Bioinformatics: Introduction and Methods

Disordered Region

122 SAPs in disordered regions, 114 (93%) are disease-associated.

From: http://ist.temple.edu/disprot/index.php

Page 78: Bioinformatics: Introduction and Methods

Changed

Hydrogenbond

Disease Polymorphism ratio

-6 1 0 1/0

-5 12 1 12

-4 44 2 22

-3 114 16 7.25

-2 230 55 4.18

-1 403 213 1.89

0 1142 716 1.59

1 224 142 1.58

2 68 36 1.89

3 11 4 2.75

4 0 2 0

5 0 2 0

Hydrogen bond change

Page 79: Bioinformatics: Introduction and Methods

Other attributes

52 SAPs in transmembrane regions, 49 (94%) are disease-

associated

194 SAPs altered β-aggregation properties, 169 (87%) are

disease-associated

435 SAPs from HLA families, all except one are “polymorphism”.

Page 80: Bioinformatics: Introduction and Methods

SVM classifier SVM – support vector machine Separate transformed data with a hyper plane in a high‐dimensional space

Kernel function – Radial Basis Function(RBF)

Grid‐search to select proper values of parameter

Page 81: Bioinformatics: Introduction and Methods

Support Vector Machine (SVM) Classifier -- Grid-search for parameters

log2C = 1; log2g = -7

Page 82: Bioinformatics: Introduction and Methods

Five-fold cross-validation

Part Total proteins Total SAP Deleterious

SAP

Neutral SAP

1

2

3

4

5

Total

105

104

105

105

103

522

686

688

688

688

688

3438

449

450

450

450

450

2249

237

238

238

238

238

1189

Page 83: Bioinformatics: Introduction and Methods

SAPstatus Predictedasdisease-

association(+)

Predictedas

polymorphism(-)

Disease-association(+) TP FN

Polymorphism(-) FP TN

Accuracy: ACC and MCC

ACC TPTN

TPTNFPFN

(TPTN FPFN)

(TN FN)(TN FP)(TP FN)(TP FP)

Overall accuracy:

Matthew correlation

coefficient:

MCC

Page 84: Bioinformatics: Introduction and Methods

Predictive power

Page 85: Bioinformatics: Introduction and Methods

SAPRED web server

http://sapred.cbi.pku.edu.cn/

Page 86: Bioinformatics: Introduction and Methods

Run SAPRED

Page 87: Bioinformatics: Introduction and Methods

Results

Page 88: Bioinformatics: Introduction and Methods

Explanation of Results: Structural attributes

Page 89: Bioinformatics: Introduction and Methods

Explanation of Results: sequence attributes

Page 90: Bioinformatics: Introduction and Methods

Results using SAPRED_Seq

ACC=81.5% MCC=0.577

Page 91: Bioinformatics: Introduction and Methods

Unit 5:

Support Vector Machine(SVM) Le Zhang, Ph. D.

Computer Science Department Southwest University

Page 92: Bioinformatics: Introduction and Methods

……

Decision tree Neural Network Random Forest Ensemble learning

Model

Prediction

Training Data

New Data

Var1 Var2

Var3 VarN

Peking University

Machine learning model Methods SVM HMM Bayesian

Page 93: Bioinformatics: Introduction and Methods

Peking University

Classification Classifying data is a common task in machine learning. Suppose some given data points each belong to one of two classes, and the goal is to decide which class a new data point will be in.

Page 94: Bioinformatics: Introduction and Methods

Peking University

Introduction SVM is supervised learning model that analyze data and recognize patterns, used for classification and regression analysis. It selects a small number of critical boundary instances called support vectors from each class and build a linear discriminant function that separates them as widely as possible. SVMs can efficiently perform non‐linear classification using what is called the kernel trick, implicitly mapping their inputs into high‐dimensional feature spaces.

Page 95: Bioinformatics: Introduction and Methods

Consider a two‐class, linearly separable classification problem Many decision boundaries! Are all decision boundaries equally good?

Peking University

What is a good Decision Boundary?

Page 96: Bioinformatics: Introduction and Methods

Peking University

Decision Boundary

Intuitively, the best hyperplane is the one that represents the largest separation, or margin, between the two classes, since the larger the margin is, the lower the generalization error of the classifier will be.

Page 97: Bioinformatics: Introduction and Methods

Peking University

Support Vector The instances that are closest to the maximum‐margin hyperplane—the ones with the minimum distance to it—are called support vectors.

Page 98: Bioinformatics: Introduction and Methods

is the 1 or ‐1 to represent

y 1, 1,

0 0

Peking University

SVM - mathematics The data point is donated by , which is a n dimension vector, and the two different class. The hyperplane is 0 So the classification function is And

Page 99: Bioinformatics: Introduction and Methods

and y . And in fact, f x y . So functional margin is:

The functional margin of a hyperplane is measured by

min

Peking University

SVM - mathematics The confidence of a classification can be measured by the functional margin, which is |f x |, and whether the classification is right can be determined by the consistence of signs of f x

However, the functional margin can be scaled even if the hyperplane remain the same, for example, w and b changed into 2w and 2b.

Page 100: Bioinformatics: Introduction and Methods

r f x

| |

| |

In this maximum margin classifier, we want to max . Because the functional margin is scalable,

we can assume 1 without influence the optimal result.

Peking University

SVM - mathematics

A intuitional measurement can be obtained using the distance from the point to the hyperplane, which is called geometrical margin

Page 101: Bioinformatics: Introduction and Methods

max 1

| | . . 1 , 1,2,…, .

Which equals to

min 1

2 . . 1 , 1,2,…, .

This is a optimization model with constraints, and can be easily solve by Quadratic Programming.

Peking University

SVM - mathematics So the objective function is

Page 102: Bioinformatics: Introduction and Methods

L w,b,α 1

2 1

L

w L

b

0 0

0

Peking University

SVM - mathematics We can also solve this by Lagrange multipliers

Page 103: Bioinformatics: Introduction and Methods

f x

,

Peking University

SVM - mathematics Finally the classification function can be rewritten as

Page 104: Bioinformatics: Introduction and Methods

Peking University SVM - kernel The linear learning machine has very limited ability in practice, because of complexity in the real world, which needs more flexible hypothetical space. We can use a function ϕ to map x to a higher dimension space, in which all the points can be linear separable.

Page 105: Bioinformatics: Introduction and Methods

,

Here we get the kernel function:

K x,z ,

Peking University

kernel

So the classification function can be extended as

Page 106: Bioinformatics: Introduction and Methods

0 a

The we can construct a 5‐dimension space, where

Z , , , ,

So the hyperplane in the new feather space is

0

Peking University

kernel Take points in the picture for example, the two classes can be separated by a circle

Page 107: Bioinformatics: Introduction and Methods

Linear kernel: K x ,x ,

, Polynomial kernel: K x ,x

Gauss kernel: K x ,

Peking University

Kernel function

Page 108: Bioinformatics: Introduction and Methods

Gauss kernel

Peking University

SVM - example Linear kernel

Page 109: Bioinformatics: Introduction and Methods

Peking University

Applications SVM has been used successfully in many real‐world problems bioinformatics (Mutation classification, Cancer classification) text (and hypertext) categorization image classification – different types of sub‐problems hand‐written character recognition

Page 110: Bioinformatics: Introduction and Methods

Peking University

Pros and Cons With support vectors, the maximum‐margin hyperplane is relatively stable. However, they often produce very accurate classifiers because subtle and complex decision boundaries can be obtained. Compared with other methods, even the fastest training algorithms for support vector machines are slow when applied in the nonlinear setting.

Page 111: Bioinformatics: Introduction and Methods

Unit 6: Comparative Protein Structure Modeling

of Genes And Genomes Le Zhang, Ph. D.

Computer Science Department Southwest University

Page 112: Bioinformatics: Introduction and Methods

Catalogue

What is comparative protein structure modeling? Why could we do comparative modeling?

Why is comparative modeling important?

How to do comparative modeling?

Fold assignment and template selection

Target – template alignment

Model building

Model evaluation

• The application of comparative modeling

• Comparative modeling in structural genomics

Page 113: Bioinformatics: Introduction and Methods

1. What Is Comparative Protein Structure Modeling?

• Comparative protein structure modeling predicts the three‐ dimensional structure for a given protein sequence of unknown structure (target) on the basis of sequence similarity to proteins of known structure (the templates).

Page 114: Bioinformatics: Introduction and Methods

2. Why Could We Do Comparative Modeling?

• Small changes in the protein sequence usually result in small changes in its 3D structure. If similarity between two proteins is detectable at the sequence level, structural similarity can usually be assumed.

• The number of unique structural folds that proteins adopt is limited and because the number of experimentally determined new structures is increasing exponentially.

Page 115: Bioinformatics: Introduction and Methods

• Designing mutants to test hypotheses about a protein’s function

• Identifying active and binding

• Identifying, designing and improving ligands for a given binding site

• Modeling substrate specificity

• Predicting antigenic epitopes

• Facilitating molecular replacement in x‐ray structure determination

• Refining models based on NMR constraints

• Testing and improving a sequence‐structure alignment

• Confirming a remote structural relationship

• Rationalizing known experimental observations.

3. Why Comparative Modeling Is Important?

• It is an efficient way to obtain useful information about the proteins of interest.

• Simulating protein–protein docking

• Inferring function from a calculated electrostatic potential around the protein

Page 116: Bioinformatics: Introduction and Methods

4. How To Do Comparative Modeling?

• Fold assignment and template selection

• Target – template alignment

• Model Building

• Model evaluation

Page 117: Bioinformatics: Introduction and Methods

• Three main classes of protein comparison methods :

1. Comparing the target sequence with each of the database sequences independently. Program : BLAST, FASTA etc.

2. Using multiple sequence comparisons to improve the sensitivity of the search. Program : PSI‐BLAST etc.

*especially useful when the sequencing identity below 25%

3. Threading or 3D template matching methods. *especially useful when there are no sequences clearly related to the modeling target.

4.1 Fold Assignment And Template Selection

Page 118: Bioinformatics: Introduction and Methods

4.1 Fold Assignment And Template Selection

• Template selection :

A higher sequence similarity, The family of proteins, The quality of template structure, Solvent, pH, ligands…

• Potential problems:

Distantly related proteins used as templates (i.e., less than 25% sequence identity) may produce an unreliable model.

Page 119: Bioinformatics: Introduction and Methods

4.1 Fold Assignment And Template Selection

• The databases and Programs you may use in this step:

a S, server , P, program b Some of the sites are mirrored on additional computers

C (a) MolSoft Inc., San Diego. (b) Molecular Simulations

Inc., San Diego. (c) Tripos Inc., St Louis. (d) ProCeryon Biosciences Inc. New York.

Page 120: Bioinformatics: Introduction and Methods

• Once templates have been selected, a specialized method should be used to align the target sequence with the template structures. Program : CLUSTAL etc.

• The alignment becomes difficult in the “twilight zone” of less than 30% sequence identity. (Only 20% of the residues are likely to be correctly aligned when two proteins share 30% sequence.)

4.2 Target – Template Alignment

Similarity of BLOSUM62 is 62%, also ~45 & ~80.

Page 121: Bioinformatics: Introduction and Methods

4.2 Target – Template Alignment

• In difficult cases, it is frequently beneficial to rely on multiple structure and sequence information. The information from structures helps to avoid gaps in secondary structure elements, in buried regions, or between two residues that are far in space.

• Potential problems: Although you can use the methods aforementioned, misalignment may occur especially when the target‐template sequence identity decreases below 30%.

Page 122: Bioinformatics: Introduction and Methods

4.2 Target – Template Alignment

• Programs and World Wide Web servers you may use in this step:

a S, server , P, program b Some of the sites are mirrored on additional computers

C (a) MolSoft Inc., San Diego. (b) Molecular Simulations Inc., San Diego. (c) Tripos Inc., St Louis. (d) ProCeryon Biosciences Inc. New York.

Page 123: Bioinformatics: Introduction and Methods

4.3 Model Building

• Three classes of methods can be used to construct a 3D model:

1. Modeling by Assembly of Rigid(刚性的) Bodies

Assemble a model from a small number of rigid bodies obtained from aligned protein structures.

2. Modeling by Segment Matching or Coordinate Reconstruction

Use a subset of atomic positions from template structures as “guiding” positions, and by identifying and assembling short, all‐atom segments that fit these guiding positions.

3. Modeling by Satisfaction of Spatial(空间的) Restraints(约束) Generate many constraints or restraints on the structure of the target sequence, using its

alignment to related protein structures as a guide.

Page 124: Bioinformatics: Introduction and Methods

4.3 Model Building

• Programs and World Wide Web servers you may use in this step:

a S, server , P, program b Some of the sites are mirrored on additional computers

C (a) MolSoft Inc., San Diego. (b) Molecular Simulations Inc., San Diego. (c) Tripos Inc., St Louis. (d) ProCeryon Biosciences Inc. New York.

Page 125: Bioinformatics: Introduction and Methods

4.3.1 Loop Modeling

• Loops often determine the functional specificity of a given protein framework. They contribute to active and binding sites.

• Loop modeling can be seen as a mini–protein folding problem, but they are generally too short to provide sufficient information about their local fold.

• Three methods:

1) Ab initio methods

2) Database search techniques 3) Both

Page 126: Bioinformatics: Introduction and Methods

4.3.2 Sidechain Modeling • Side chain conformations are predicted from similar structures and from steric(立体的) or energetic considerations. • They are modeled using structural information from proteins in general and from equivalent disulfide(二硫) bridges in related structures. • Two effects on sidechain conformation: 1) The coupling between the main chain and side chains

2) The continuous nature of the distributions of side‐chain dihedral angles(二面角)

• Three different side‐chain prediction methods : 1)The packing of backbone‐dependent rotamers(旋转异构体) 2)The self‐consistent mean‐field approach to positioning rotamers based on their van der Waals interactions 3)The segment‐matching method of Levitt

Page 127: Bioinformatics: Introduction and Methods

4.3.3 Potential Problems

• According to a recent survey analyzed the accuracy of 3 modeling methods, they can only correctly predict approximately 50% of χ1 angles and 35% of both χ1 and χ2 angles.

• Segments of the target sequence that have no equivalent region in the template structure (i.e., insertions or loops) are the most difficult regions to model, especially when the insertion is more than 9 residues long.

• Some correctly aligned segments of a model, the template is locally different (<3 A˚) from the target, resulting in errors in that region.

• As the sequences diverge, the packing of side chains in the protein core may changes.

Page 128: Bioinformatics: Introduction and Methods

4.4 Model Evaluation

• Typical errors in comparative models :

1. Errors in side‐chain packing 2. Distortions and shift in correctly aligned regions. 3. Errors in regions without a template 4. Errors due to misalignments 5. Incorrect template.

Page 129: Bioinformatics: Introduction and Methods

4.4 Model Evaluation

• Typical errors in comparative models :

Page 130: Bioinformatics: Introduction and Methods

4.4 Model Evaluation

The criteria of evaluation

Having the correct fold or not

The target‐template sequences similarity

Distributions of many spatial features

The environment

Having good stereochemistry or not

Page 131: Bioinformatics: Introduction and Methods

4.4 Model Evaluation

1) Having the correct fold or not A model will have the correct fold if the correct template is picked and if that template is aligned at least approximately correctly with the target sequence. A

The fold of a model can be assessed by a high sequence similarity with the closest template, an energy based Z‐score, or by conservation of the key functional or structural residues in the target sequence.

2) The target‐template sequences similarity Sequence identity above 30% is a relatively good predictor of the expected accuracy.

Page 132: Bioinformatics: Introduction and Methods

Average model accuracy as a function of the template‐target sequences similarity

4.4 Model Evaluation

EDN: human eosinophil neurotoxin, is a ribonuclease with 3 α-

helices and 2 three-stranded antiparallel β-sheets arranged in a

single domain.

CRABPI: mouse cellular retinoic acid binding protein I, is a single domain protein composed of interacting α‐helices packed at the edge of two orthogonal, 4‐ and 6‐stranded antiparallel β‐sheets. For the CRABPI

model, 90% of Cαatoms superpose within 3.5 Å of their counterparts in the X‐ray structure; the rms error is 1.31 Å.

NM23H2: Human nucleoside diphosphate kinase, is a single

domain protein consisting of a central 4-stranded antiparallel β-

sheet surrounded by 8 α-helices. For the NM23H2 model, all but

one Cαatom superpose within 3.5 Å of the X-ray structure; rms difference is 0.41 Å.

Page 133: Bioinformatics: Introduction and Methods

Solid line: sample models Dotted line: corresponding actual structures

4.4 Model Evaluation Average model accuracy as a function of the template‐target sequences similarity Percentage structure overlap is defined as the

fraction of equivalent residues. Two residues are equivalent when their Cα atoms are within 3.5 Å of each other upon rigid‐body, least‐squares superposition of the two structures.

Page 134: Bioinformatics: Introduction and Methods

3) The environment

Example: some calcium‐binding proteins undergo large conformational changes when bound to calcium. If a calcium‐free template is used to model the calcium‐bound state of the target, it is likely that the model will be incorrect.

4) Having good stereochemistry or not

Including bond lengths, bond angles, peptide bond and side‐chain ring planarities, chirality, main‐chain and side‐chain torsion angles, and clashes between nonbonded pairs of atoms.

5) Distributions of many spatial features

Such features include packing, formation of a hydrophobic core, residue and atomic solvent acces sibilities, spatial distribution of charged groups, distribution of atom‐atom distance, atomic volumes, and main‐chain hydrogen bondin.

4.4 Model Evaluation

Page 135: Bioinformatics: Introduction and Methods

4.4 Model Evaluation

• There are also methods for testing 3D models that implicitly take into account many of the criteria listed above. These methods are based on 3D profiles and statistical potentials of mean force.

• A physics‐based approach to deriving energy functions has been tested for use in protein structure evaluation (1999).

Page 136: Bioinformatics: Introduction and Methods

4.4 Model Evaluation

• Programs and World Wide Web servers you may use in this step:

a S, server , P, program b Some of the sites are mirrored on additional computers

C (a) MolSoft Inc., San Diego. (b) Molecular Simulations Inc., San Diego. (c) Tripos Inc., St Louis. (d) ProCeryon Biosciences Inc. New York.

Page 137: Bioinformatics: Introduction and Methods

Low accuracy <30% sequence identity Less than 50% of their Cα

atoms within 3.5 Å of their correct positions

High accuracy >50% sequence identity Approaches that of low

resolution X‐ray structures or medium resolution NMR structures rw (van der Waals radius) of C atom = 1.70Å

5. The Application of Comparative Modeling

• Three levels of model accuracy and some of the corresponding applications

Three levels

Middle aaccuracy 30‐50% sequence identity 85% of their Cα atoms within 3.5 Å of their correct positions

Page 138: Bioinformatics: Introduction and Methods

5. The Application of Comparative Modeling

• Applications1: low accuracy models

• •

<30% sequence identity, having the correct fold Less than 50% of their Cα atoms within 3.5 Å of their correct

positions

• Use: To confirm or reject a match between remotely related proteins

Page 139: Bioinformatics: Introduction and Methods

5. The Application of Comparative Modeling

• 30‐50% sequence identity

• 85% of their Cα atoms within 3.5 Å of their correct positions

• Use: Refinement of the functional prediction based on sequence to construct site‐directed mutants with altered or destroyed binding capacity other problems...

• Applications2: middle accuracy models

Page 140: Bioinformatics: Introduction and Methods

5. The Application of Comparative Modeling

• Applications3: high accuracy models

• >50% sequence identity • The average accuracy of these models approaches that of low resolution X‐ray structures (3 Å resolution) or medium resolution NMR structures (10 distance restraints per residue) • s

• Use: For docking of small ligands or whole proteins onto a given protein.

Page 141: Bioinformatics: Introduction and Methods

6. Comparative modeling in structural genomics

• The aim of structural genomics is to determine or accurately predict the 3D structure of all the proteins encoded in the genomes.

• This aim will be achieved by a focused, large‐scale determination of protein structures by X‐ray crystallography and NMR spectroscopy, combined efficiently with accurate protein structure modeling techniques.

Page 142: Bioinformatics: Introduction and Methods

6. Comparative modeling in structural genomics

• For comparative modeling to contribute to structural genomics, automation of all the steps in the modeling process is essential.

• The automation of large‐scale comparative modeling involves assembling a software pipeline that consists of modules for fold assignment, template selection, target–template alignment, model generation, and model evaluation.

Page 143: Bioinformatics: Introduction and Methods

• Two examples of large‐scale comparative modeling for complete genomes:

the SWISS‐MODEL web server: The sequences encoded in the E. coli genome have been used to build models for 10–15% of the proteins using the SWISS‐MODEL web server.

MODPIPE: MODPIPE produced models for five procaryotic and eukaryotic genomes. This calculation resulted in models for substantial segments of 17.2%, 18.1%, 19.2%, 20.4%, and 15.7% of all proteins in the genomes of Saccharomyces cerevisiae (6218 proteins in the genome); Escherichia coli (4290 proteins), Mycoplasma genitalium (468 proteins), Caenorhabditis elegans (7299 proteins, incomplete), and Methanococcus janaschii (1735 proteins).

6. Comparative modeling in structural genomics

Page 144: Bioinformatics: Introduction and Methods

• Large‐scale comparative modeling will extend opportunities to tackle a myriad of problems by providing many protein models for many genomes.

Rotein evolution Drug design

A facile comparison of ligand binding requirements and Substitutions in and around important residues ......

A specific example:

The selection of a target protein for drug development !

6. Comparative modeling in structural genomics

Page 145: Bioinformatics: Introduction and Methods

7. Conclusion • Over the past few years, there has been a gradual increase in both the accuracy of comparative models and the fraction of protein sequences that can be modeled with useful accuracy. • Further advances are necessary in recognizing weak sequence–structure similarities, aligning sequences with structures, modeling of rigid body shifts, distortions, loops and side chains, as well as detecting errors in a model. • It is currently possible to model with useful accuracy significant parts of approximately one third of all known protein sequences. • A major new challenge for comparative modeling is the integration of it with the torrents of data from genome sequencing projects as well as from functional and structural genomics.

Page 146: Bioinformatics: Introduction and Methods

Reference • Martí‐Renom M A, Stuart A C, Fiser A, et al. Comparative protein structure modeling of genes and genomes[J]. Annual review of biophysics and biomolecular structure, 2000, 29(1): 291‐325. • Šali A, Potterton L, Yuan F, et al. Evaluation of comparative protein modeling by MODELLER[J]. Proteins: Structure, Function, and Bioinformatics, 1995, 23(3): 318‐326. • Fiser A, Do R K G, Šali A. Modeling of loops in protein structures[J]. Protein science, 2000, 9(9): 1753‐1773. • Fiser A, Do R K G, Šali A. Modeling of loops in protein structures[J]. Protein science, 2000, 9(9): 1753‐1773. • Sánchez R, Šali A. Comparative protein structure modeling in genomics[J]. Journal of Computational Physics, 1999, 151(1): 388‐401.

Page 147: Bioinformatics: Introduction and Methods

Bioinformatics: Introduction and Methods

Computer Science Department, Southwest University

Thank you