computational proteomics: structure/function prediction & the protein interactome jaime...

Computational Proteomics: Structure/Function Prediction

& the Protein Interactome

Jaime Carbonell ([email protected]), with Betty Cheng, Yan Liu, Eric Xing, Yanjun Qi, Judith Klein-Seetharaman, and Oznur Tastan

Carnegie Mellon UniversityPittsburgh PA, USA

December, 2008

© 2003, Jaime Carbonell2

Simplified View of Biology

Nobelprize.org

Protein sequence

Protein structure


Primary SequenceMNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT LCCGKNPLGD DEASTTVSKT ETSQVAPA

3D Structure

Folding

Complex function within network of proteins

Normal

PROTEINSSequence Structure Function

(Borrowed from: Judith Klein-Seetharaman)


Primary SequenceMNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT LCCGKNPLGD DEASTTVSKT ETSQVAPA

3D Structure

Folding

Complex function within network of proteins

Disease

PROTEINSSequence Structure Function


Motivation: Protein Structure and Function Prediction

•Ultimate goal: Sequence Function – …and Function Sequence (drug design, …)– Potential active binding sites are a good start, but how

about stability, external accessibility, energetics, …

•Intermediate goal: Sequence Structure– Only 1.2% of proteins have been structurally resolved– What-if analysis (precursor of mutagenesis exp’s)

•Machine Learning & Lang Tech methods– Powertools to model and predict structure & function– ComBio challenges are starting to drive new research

in Machine Learning & Language Technologies


OUTLINE

•Motivation: sequencestructurefunction•Vocabulary-based classification approaches

(Betty Cheng, Jaime Carbonell, Judith Klein-Seetharaman)– GPRC Subfamily classification– Protein-protein coupling specificity

•Solving the “Folding Problem” Machine Learning Approaches to Structure Prediction (Yan Liu, Jaime Carbonell, et al)

– Teriary folds: β-helix prediction via segmented CRFs– Quaternary Folds: Viral adhesin and capsid complexes

•Conclusions and future directions


GPRC Super-family:G-Protein Coupled Receptors

• Transmembrane protein

• Target of 60% drugs (Moller, 2002)

• Involved in cancer, cardiovascular disease, Alzheimer’s and Parkinson’s diseases, stroke, diabetes, and inflammatory and respiratory diseases

VII VIC-Terminus

N-Terminus

Intracellular Loops

Extracellular Loops

Membrane

I

II III IVV


Protein Family & Subfamily Classification (applied to GPCRs)

Subfamily classification based on pharmaceutical properties


Comparative Study – Karchin et al., 2002

Support Vector Machines, Neural Nets, Clustering

Hidden Markov Models

K-Nearest Neighbours, BLAST

Complex

Simple

SVM is the best for subfamily classification- Karchin et al., 2002

Decision Trees, Naïve Bayes

Traditionally, hidden Markov models, k-nearest neighbours and BLAST have been used.

Recently, more complicated classifiers have been used.

Karchin et al. (2002) studied a range of classifiers of varied complexity in GPCR subfamily classification.

But what about those simple classifiers at the other end of the scale?

Hypothesis: Bio-vocabulary selection is crucial for sub-family

classification (and protein-protein interaction prediction)


Study “segments” with different vocabulary

AA, chemical groups, properties of AA


Computing Chi-Square

Cc xc

xcxcx

),e(

),o(),e()(

22

N

tnxc

xc),e(

Observed # of sequences with

feature x

Expected # of sequences

with feature x

Total # of sequences

# of sequences with feature x

# of sequences in

class c


Level I Subfamily Optimization

Number of Features

Acc

ura

cy

Decision Trees Naïve Bayes

Binary Features

N-gram Counts


Level I Subfamily Results

Classifier # of Features Type of Features AccuracyNaïve Bayes 5500-7700 Binary 93.0 %

3300-6900 N-gram counts 90.6 %

All (9702) N-gram counts 90.0 %

SVM 9 per match state in the HMM

Gradient of the log-likelihood that the sequence is generated by the given HMM model

88.4 %

BLAST Local sequence alignment 83.3 %

Decision Tree 900-2800 Binary 77.3 %

700-5600 N-gram counts 77.3 %

All (9723) N-gram counts 77.2 %

SAM-T2K HMM A HMM model built for each protein subfamily 69.9 %

kernNN 9 per match state in the HMM


64.0 %


Level II Subfamily Results

Classifier # of Features Type of Features Accuracy

Naïve Bayes 8100 Binary 92.4 %

SVM 9 per match state in the HMM


86.3 %

Naïve Bayes 5600 N-gram counts 84.2 %

SVMtree 9 per match state in the HMM


82.9 %

Naïve Bayes All (9702) N-gram counts 81.9 %

BLAST Local sequence alignment 74.5 %

Decision Tree 1200 N-gram counts 70.8 %

Decision Tree 2300 Binary 70.2 %

SAM-T2K HMM A HMM model built for each protein subfamily 70.0 %

Decision Tree All (9723) N-gram counts 66.0 %

kernNN 9 per match state in the HMM


51.0 %

Helix 3 and 7 known to be important for signal transduction

Top 20 selected “words” for Class B GPCRs. They correlate with identified motifs.

Loop 1 is suspected common binding site


Generalization to Other Superfamilies: Nuclear Receptors

Dataset Feature Type # of Features Accuracy

Validation Testing

Family Binary 1500-4200 96.96% 94.53%

N-grams counts 400-4900 95.75% 91.79%

Level I Subfamily

Binary 1500-3100 98.09% 97.77%

N-gram counts 500-1100 93.95% 91.40%

Level II Subfamily

Binary 1500-2100 95.32% 93.62%

N-gram counts 3100-5600 86.39% 85.54%


G-Protein Coupling Specificity Problem

• Predict which one or more families of G-proteins a GPCR can couple with, given the GPCR sequence

• Locate regions in the GPCR sequence where the majority of coupling specificity information lies

G-Protein Family Function

GsActivates adenylyl cyclase

Gi/oInhibits adenylyl cyclase

Gq/11Activates phospholipase C

G12/13Unknown


N-gram Based Component

• Extract n-grams from all possible reading frames

• Use a set of binary k-NN, one for each G-protein family to predict whether the receptor couples to the family

• Predict coupling if k-NN outputs a probability higher than trained threshold

MGNASNDSQSEDCETRQWLPPGESPAI …

Test Sequence01001………51571225

Counts of all n-grams

K-NN Classifier

Pr(coupling to family C) ≥ threshold?

Predict coupling to family C

Predict no coupling to family C

Yes No


Alignment-Based Component

• A set of binary k-NN, one for each G-protein family to predict whether the receptor couples to the family

• Predict coupling if more than x% of retrieved sequences couple to the family

• 2 parameters:

– Number of neighbours, K– Threshold x%


MDNTSNDSQSENREEPLWLPSGESPAIS …

MDNFLNDSKLMEDCKSRQWLLSGESPAI …

MNESYRCQTSTWVERGSSATMGAVLFG …

BLAST

x% of the K1 sequences couple to family C?

Test Sequence

K1 most similar sequences



Yes No


MDNTSNDSQSENREEPLWLPSGESPAIS …

MDNFLNDSKLMEDCKSRQWLLSGESPAI …

MNESYRCQTSTWVERGSSATMGAVLFG …

BLAST

x% of the K1 sequences couple to family C?

Test Sequence

K1 most similar sequences



Yes No


Our Hybrid Method:Combining Alignment and N-grams


BLAST K-NN,x% = 100%

Test Sequence


YesNo

N-gram K-NN


Yes No

Predict nocoupling to family C


Evaluation Metrics & Dataset

CA

AcallRe

BA

AecisionPr

DCBA

DAAccuracy

RP

PRF

21

Truth

Predict

Couplings Non-Couplings

Couplings A B

Non-Couplings

C D

(Cao et al., 2003)

81.3% training set

Same test set


Results on Cao et al. Dataset

Method N-gram Threshold Prec Recall F1

Hybrid 0.66 0.698 0.952 0.805

N-gram 0.34 0.658 0.794 0.719

Cao et al. 0.577 0.889 0.700

Method Max Prec Recall F1

Whole Seq Alignment F1 0.779 0.841 0.809

Hybrid F1 0.775 0.873 0.821

Whole Seq Alignment Precision 0.793 0.730 0.760

Hybrid Precision 0.803 0.778 0.790

• Suggests n-grams contain information not found in alignment

• Hybrid method outperformed Cao et al. in precision, recall and F1• Suggests alignment contains information not found in n-grams


Feature Selection of N-grams

• Pre-processing step to remove noisy or redundant features that may confuse classifier

• Many feature selection algorithms available

• Chi-square was used because of success in GPCR subfamily classification


K-NN Classifier

Test Sequence

Pr(coupling to family C) =threshold?



Yes No

01001………51571225

001……10200

Chi-Square Feature Selection


Selected n-gram counts


K-NN Classifier

Test Sequence

Pr(coupling to family C) =threshold?



Yes No

01001………51571225

001……10200

Chi-Square Feature Selection


Selected n-gram counts


IC Domain Combination Analysis

IC Prec Rec F1 Acc

1 0.782 0.703 0.739 0.796

2 0.820 0.799 0.808 0.845

3 0.661 0.721 0.682 0.730

4 0.632 0.755 0.670 0.694

1, 2 0.820 0.805 0.811 0.847

1, 3 0.799 0.765 0.780 0.825

1, 4 0.780 0.755 0.765 0.807

2, 3 0.837 0.825 0.828 0.861

2, 4 0.828 0.816 0.821 0.853

3, 4 0.773 0.807 0.788 0.821

1, 2, 3 0.822 0.814 0.816 0.850

1, 2, 4 0.807 0.809 0.807 0.843

1, 3, 4 0.792 0.807 0.797 0.832

2, 3, 4 0.839 0.820 0.828 0.861

1, 2, 3, 4 0.824 0.813 0.817 0.853

• Of the 4 domains, 2nd domain yielded best F1 followed by 1st, 3rd and 4th domains

• Most information in IC1 already found in IC2


Tertiary Protein Fold Prediction

• Protein function strongly modulated by structure• Predicting folds, domains and other regular structures

requires modeling local and long distance interactions in low-homology sequences

– Long distance: Not addressed by n-grams, HMMs, etc.– Low homology: Not address by BLAST algorithms

• We focus on minimal mathematical structural modeling– Segmented conditional random fields– Layered graphical models– Fully trainable to recognize new instances of structures

• First acid-test: β-helix super-secondary structural prediction (with data and guidance from Prof. J. King at MIT)


Protein Structure Determination

• Lab experiments: time, cost, uncertainty, …– X-ray crystallography (months to crystalize, uncertain outcome) Nobel Prize, Kendrew & Perutz, 1962

– NMR spectroscopy (only works for small proteins or domains)Nobel Prize, Kurt Wuthrich, 2002

• The gap between sequence and structure necessitates computational methods of protein structure determination– 3,023,461 sequences v.s. 36,247 resolved structures (1.2%)

1MBN

1BUS

Predicting Protein Structures

• Protein Structure is a key determinant of protein function

• Crystalography to resolve protein structures experimentally in-vitro is very expensive, NMR can only resolve very-small proteins

• The gap between the known protein sequences and structures:

– 3,023,461 sequences v.s. 36,247 resolved structures (1.2%)

– Therefore we need to predict structures in-silico


Predicting Tertiary Folds

• Super-secondary structures

– Common protein domains and scaffolding patterns such as regular combinations of β-sheets and/or -helices

• Out task

– Given a protein sequence, predict supersecondary structures and their components (e.g. β-helices and the location of each rung therein)

• Examples:– Parallel Right-handed β-helix Leucine-rich repeats


Parallel Right-handed β-Helix

• Structure– A regular super-secondary structure with an

an elongated helix whose successive rungs are composed of beta-strands

– Highly-conserved T2 turn

• Computational importance– Long-range interactions

– Repeat patterns

• Biological importance– functions such as the bacterial infection of

plants, binding the O-antigen and etc.


Conditional Random Fields

• Hidden Markov model (HMM) [Rabiner, 1989]

• Conditional random fields (CRFs) [Lafferty et al, 2001]

– Model conditional probability directly (discriminative models, directly optimizable)

– Allow arbitrary dependencies in observation – Adaptive to different loss functions and regularizers– Promising results in multiple applications– But, need to scale up (computationally) and extend to long-

distance dependencies

11

( ) ( | ) ( | )N

i i i ii

P P x y P y y

x, y

11 10

1( ) exp( ( , , , ))

N K

k k i ii k

P f i y yZ

y | x x


• Outputs Y = {M, {Wi} }, where Wi = {pi, qi, si}

• Feature definition

– Node feature

– Local interaction feature

– Long-range interaction feature

Our Solution: Conditional Graphical Models

1 1 1( , , ) ( , ', 1)k i i i i i if w w x I s s s s p q

( , ) '( , , ) ( ', 1 ')k i k i i i i if w x f x p q I s s q p d

Long-range dependencyLocal dependency

1( , , ) '( , , , , ) ( , ')k i j k i i j j i if w w x g x p q p q I s s s s


Linked Segmentation CRF

• Node: secondary structure elements and/or simple fold

• Edges: Local interactions and long-range inter-chain and intra-chain interactions

• L-SCRF: conditional probability of y given x is defined as

, , ,

1 1 , , ,,

1( ,..., | ,..., ) exp( ( , )) exp( ( , , , ))

i j G i j a b G

R R k k i i j l k i a i j a bV k lE

P f g yZ

y y y

y y x x x y x x y

Joint Labels


• Classification:

• Training : learn the model parameters λ

– Minimizing regularized negative log loss

– Iterative search algorithms by seeking the direction whose empirical values agree with the expectation

• Complex graphs results in huge computational complexity

Linked Segmentation CRF (II)

( | )( ( , ) [ ( , )]) ( ) 0G

k c p k cc Ck

Lf E f

y xx y x y

21

( , ) log ( )G

K

k k cc C k

L f Z

x y

1

* arg max ( , )G

K

k k cc C k

y f Y

x


Model Roadmap

Conditional random fields [lafferty et al, 2001]

Segmentation CRFs (Liu & Carbonell 2005)

Chain graph model (Liu, Xing & Carbonell, 2006)

Linked segmentation CRFs (Liu & Carbonell, 2007)

Long-range

Trade-off between local and long-range

Inter-chain long-range

Semi-markov CRFs [Sarawagi & Cohen, 2005]

Beyond Markov dependencies

Generalized discriminative graphical models


Tertiary Fold Recognition: β-Helix fold

• Histogram and ranks for known β-helices against PDB-minus dataset

5

Chain graph model reduces the real running time of SCRFs model by around 50 times


Fold Alignment Prediction: β-Helix

• Predicted alignment for known β -helices on cross-family validation


Discovery of New Potential β-helices

• Run structural predictor seeking potential β-helices from Uniprot (structurally unresolved) databases

– Full list (98 new predictions) can be accessed at www.cs.cmu.edu/~yanliu/SCRF.html

• Verification on 3 proteins with later experimentally resolved structures from different organisms

– 1YP2: Potato Tuber ADP-Glucose Pyrophosphorylase

– 1PXZ: The Major Allergen From Cedar Pollen

– GP14 of Shigella bacteriophage as a β-helix protein

– No single false positive!

http://www.cs.cmu.edu/~yanliu/SCRF.html


Predicting Quaternary Folds

• Triple beta-spirals [van Raaij et al. Nature 1999]

– Virus fibers in adenovirus, reovirus and PRD1

• Double barrel trimer [Benson et al, 2004]

– Coat protein of adenovirus, PRD1, STIV, PBCV


Features for Protein Fold Recognition


Experiment Results: Quaternary Fold Recognition

Double barrel-trimersTriple beta-spirals


Experiment Results: Alignment Prediction

Triple beta-spirals

Four states: B1, B2, T1 and T2

Correct Alignment:

B1: i – o B2: a - h

Predicted Alignment

B1 B2


Experiment Results:Discovering New Membership Proteins

• Predicted membership proteins of triple beta-spirals can be accessed at

http://www.cs.cmu.edu/~yanliu/swissprot_list.xls

• Membership proteins of double barrel-trimer suggested by biologists [Benson, 2005] compared with L-SCRF predictions


Conclusions & Challenges for Protein Structure/Function Prediction

• Methods from modern Machine Learning and Language Technologies really work in Computational Proteomics

– Family/subfamily/sub-subfamily predictions– Protein-protein interactions (GPCRs G-proteins)– Accurate tertiary & quaternary fold structural predictions

• Next generation of model sophistication…• Addressing new challenges

– Structure Function: Structural predictions combined with binding-site & specificity analysis

– Predictive Inversion: Function Structure Sequence for new hyper-specific drug design (anti-viral, oncology)


Proteins and Interactions

• Every function in the living cell depends on proteins

• Proteins are made of a linear sequence of amino acids and folded into unique 3D structures

• Proteins can bind to other proteins physically

– Enables them to carry out diverse cellular functions


Protein-Protein Interaction (PPI) Network

• PPIs play key roles in many biological systems

• A complete PPI network (naturally a graph)

– Critical for analyzing protein functions & understanding the cell

– Essential for diseases studies & drug discoveries


PPI Biological Experiments

• Small-scale PPI experiments One protein or several proteins at a time Small amount of available data Expensive and slow lab process

• Large-scale PPI experiments Hundreds / thousands of proteins at a time Noisy and incomplete data Little overlap among different sets

Large portion of the PPIs still missing or noisy !


Learning of PPI Networks

• Goal I: Pairwise PPI (links of PPI graph)– Most protein-protein interactions (pairwise) have not been

identified or noisy Missing link prediction !

• Goal II: “Complex” (important groups)– Proteins often interact stably and perform functions together as

one unit (“complex” )

– Most complexes have not be discovered Important group detection !

Pairwise Interactions

Protein ComplexPPI NetworkLink Prediction

Group Detection

© 2003, Jaime Carbonell48 48

Goal I: Missing Link Prediction


PPI Network


Related Biological Data

• Overall, four categories:

– Direct high-throughput experimental data: Two-hybrid screens (Y2H) and mass spectrometry (MS)

– Indirect high throughput data: Gene expression, protein-DNA binding, etc.

– Functional annotation data: Gene ontology annotation, MIPS annotation, etc.

– Sequence based data sources: Domain information, gene fusion, homology based PPIs, etc.

direct

Indirect

Utilize implicit evidence and available direct experimental results together


Related Data Evidence

Relational Evidence Between Proteins

1 Synthetic lethal

Attribute Evidence of Each Protein

Expression

Structure

Sequence

Annotation

……

……

Relation expanding1


Feature Vector for (Pairwise) Pairs

– For data representing protein-protein pairs, use directly

– For data representing single protein (gene), calculate the (biologically meaningful) similarity between two proteins for each evidence

Synthetic lethal: 1……

Sequence SimilarityGeneExp CorrelationCoeff…

Pair A-B: fea1, fea2, fea3, …….

Sequence: mtaaqaagee…

GeneExp: 233.94, 162.85, ...

….

Sequence: mrpsgtagaa…

GeneExp: 109.4, 975.3, ...

…

Protein B Protein A

Pair A-B


Problem Setting

• For each protein-protein pair: – Target function: interacts or not ?– Treat as a binary classification task

• Feature Set

– Feature are heterogeneous

– Most features are noisy

– Most features have missing values

• Reference Set:

– Small-scale PPI set as positive training (hundreds thousands)

– No negative set (non-interacting pairs) available

– Highly skewed class distribution» Much more non-interacting pairs than interacting pairs

» Estimated: 1 out of ~600 yeast; 1 out of ~1000 human


PPI Inference via ML Methods

• Jansen,R., et al., Science 2003– Bayes Classifier

• Lee,I., et al., Science 2004– Sum of Log-likelihood Ratio

• Zhang,L., et al., BMC Bioinformatics 2004– Decision Tree

• Bader J., et al., Nature Biotech 2004– Logistic Regression

• Ben-Hur,A. et al., ISMB 2005– Kernel Method

• Rhodes DR. et al., Nature Biotech 2005– Naïve Bayes

Present focus: Y. Qi, Z. Bar-Joseph, J. Klein-Seetharaman, Proteins 2006


Predicting Pairwise PPIs

– Prediction target (three types)» Pphysical interaction,

» Co-complex relationship,

» Pathway co-membership inference

– Feature encoding » (1) “detailed” style, and (2) “summary” style

» Feature importance varies

– Classification methods» Random Forest & Support Vector Machine

Y. Qi, Z. Bar-Joseph, J. Klein-Seetharaman, Proteins 2006

Details in the paper


Human Membrane Receptors

Ligands

Signal Transduction Cascades

extracellularOther Membrane Proteinstransmembrane

cytoplasmic

Type I Type II (GPCR)

PPI Predictions for Human Membrane Receptors

• A combined approach

– Binary classification

– Global graph analysis

– Biological feedback & validation

Y. Qi, et al 2008


• Random Forest Classifier– A collection of independent decision trees ( ensemble classifier)

– Each tree is grown on a bootstrap sample of the training set

– Within each tree’s training, for each node, the split is chosen from a bootstrap sample of the attributes

Binary Classification

GeneExpress

TAP

Y2H

GOProcess N HMS_PCI N

GeneOccur Y GOLocalization Y

ProteinExpress

GeneExpress

Gene Express

Domain

Y2HHMS-PCI SynExpress ProteinExpress

• Robust to noisy features• Can handle different types of features


• Compare Classifiers • Receptor PPI (sub-network) to general human PPI prediction

Classifier Comaparison

(27 features extracted from 8 different data sources, modified with biological feedbacks)

Global Graph Analysis

• Degree distribution / Hub analysis / Disease checking

• Graph modules analysis (from bi-clustering study)

• Protein-family based graph patterns (receptors / receptors subclasses / ligands / etc )

59


Global Graph Analysis

Network analysis reveals interesting features of the human membrane receptor PPI graph

60

For instance:

• Two types of receptors (GPCR and non-GPCR (Type I))

• GPCRs less densely connected than non-GPCRs(Green: non-GPCR receptors; blue: GPCR)


Experimental Validation

• FFive predictions were chosen for experiments and three were verified

– EGFR with HCK (pull-down assay)

– EGFR with Dynamin-2 (pull-down assay)

– RHO with CXCL11 (functional assays, fluorescence spectroscopy, docking)

– Experiments @ U.Pitt School of Medicine

Y. Qi, et al 2008

Details in the paper


Motivation

• Current situation of PPI task

– Only a small positive (interacting) set available

– No negative (not interacting) set available

– Highly skewed class distribution» Much more non-interacting pairs than interacting pairs

– The cost for misclassifying an interacting pair is higher than for a non-interacting pair

– Accuracy measure is not appropriate here

• Try to handle this task with ranking

– Rank the known positive pairs as high as possible

– At the same time, have the ability to rank the unknown positive pairs as high as possible


Split Features into Multi-View

• Overall, four feature groups:

– P: Direct highthroughput experimental data: Two-hybrid screens (Y2H) and mass spectrometry (MS)

– E: Indirect high throughput data: Gene expression, protein-DNA binding, etc.

– F: Functional annotation data: Gene ontology annotation, MIPS annotation, etc.

– S: Sequence based data sources: Domain information, gene fusion, homology based PPIs, etc.

Direct

Genomic

Functional

Sequence

Y. Qi, J. Klein-Seetharaman, Z. Bar-Joseph, BMC Bioinformatics 2007


Mixture of Feature Experts (MFE)

• Make protein interaction prediction by

– Weighted voting from the four roughly homogeneous feature categories

– Treat each feature group as a prediction expert

– The weights are also dependent on the input example

• Hidden variable, M modulates the choice of expert

P

FS

E Interact ?

M

XMpMXYpXYp )|(),|()|(


Mixture of Four Feature Experts

• Parameters are trained using EM

• Experts and root gate use logistic regression (ridge estimator)

Expert P Direct PPI High throughput

Experiment Data

Expert F Function Annotation

of Proteins

Expert S Sequence or Structure

based Evidence

Expert E Indirect High throughput

Experimental Data

4

1

)()()()()()()( ),1,|(*),|1()|(i

in

innnn

inn wmxypvxmpxyp

),( vwi


Mixture of Four Feature Experts

• Handling missing value

– Add additional feature column for each feature having low feature coverage

– MFE uses present / absent information when weighting different feature groups

• The posterior weight for expert i in predicting pair n

– The weight can be used to indicate the importance of that feature view ( expert ) for this specific pair

4

1

)()()()()(

)()()()()()()()()(

),1,|(*),|1(

),1,|(*),|1(),,,|1(

j

tj

nj

nntnnj

ti

ni

nntnnittnnn

in

i

wmxypvxmP

wmxypvxmPwvxymPh


Performance

• 162 features for yeast physical PPI prediction task

• Features extracted in “detail” encoding

• Under “detail” encoding, the ranking method is almost the same as RF (not shown)


Functional Expert Dominates

Figure: The frequency at which each of the four experts has maximum contribution among validated and predicted pairs

300 candidate protein pairs

51 predicted interactions

33 validated already

18 newly predicted


Protein Complex

• Proteins form associations with multiple protein binding partners stably (termed “complex”)

• Complex member interacts with part of the group and work as an unit together

• Identification of these important sub-structures is essential to understand activities in the cell

Group detection within the PPI network


Identify Complex in PPI Graph

• PPI network as a weighted undirected graph

– Edge weights derived from supervised PPI predictions:

• Previous work

– Unsupervised graph clustering style

– All rely on the assumption that complexes correspond to the dense regions of the network

• Related facts

– Many other possible topological structures

– A small number of complexes available from reliable experiments

– Complexes also have functional /biological properties (like weight / size / …)


Possible topological structures

Edge weight color coded

• Make use of the small number of known complexes supervised• Model the possible topological structures subgraph statistics• Model the biological properties of complexes subgraph features


Properties of Subgraph

• Subgraph properties as features in BN

– Various topological properties from graph

– Biological attributes of complexes

No. Sub-Graph Property

1 Vertex Size

2 Graph Density

3 Edge Weight Ave / Var

4 Node degree Ave / Max

5 Degree Correlation Ave / Max

6 Clustering Coefficient Ave / Max

7 Topological Coefficient Ave / Max

8 First Two Eigen Value

9 Fraction of Edge Weight > Certain Cutoff

10 Complex Member Protein Size Ave / Max

11 Complex Member Protein Weight Ave / Max

5/14/2008


Model Complex Probabilistically

• Bayesian Network (BN)

– C : If this subgraph is a complex (1) or not (0)

– N : Number of nodes in subgraph

– Xi : Properties of subgraph

C

N

X X X X

),...,,,|0(

),...,,,|1(log

21

21

m

m

xxxncp

xxxncpL

Assume a probabilistic model (Bayesian Network) for representing complex sub-graphs


Model Complex Probabilistically

• BN parameters trained with MLE

– Trained from known complexes and random sampled non-complexes

– Discretize continuous features

– Bayesian Prior to smooth the multinomial parameters

• Evaluate candidate subgraphs with the log ratio score L

m

kk

m

kk

m

m

cnxpcnpcp

cnxpcnpcp

xxxncp

xxxncpL

1

1

21

21

)0,|()0|()0(

)1,|()1|()1(log

),...,,,|0(

),...,,,|1(log


Experimental Setup

• Positive training data:

– Set1: MIPS Yeast complex catalog: a curated set of ~100 protein complexes

– Set2: TAP05 Yeast complex catalog: a reliable experimental set of ~130 complexes

– Complex size (nodes’ num.) follows a power law

• Negative training data

– Generate from randomly selected nodes in the graph

– Size distribution follows the same power law as the positive complexes


Evaluation

• Train-Test style (Set1 & Set2)

• Precision / Recall / F1 measures

• A cluster “detects” a complex if

A : Number of proteins only in clusterB : Number of proteins only in complexC : Number of proteins shared

If overlapping threshold p set as 50%

A C B

Detected

Cluster Known comple

x

pCA

C

p

CB

C

&


Performance Comparison

• On yeast predicted PPI graph (~2000 nodes)

• Compare to a popular complex detection package: MCODE (search for highly interconnected regions)

• Compare to local search relying on density evidence only

• Compared to local search with complex score from SVM (also supervised)

Methods Precision Recall F1

Density MCODESVMBN

0.1800.2190.2110.266

0.4620.0750.3770.513

0.2530.1110.2690.346


Human-PPI (Revise 08)HIV-Human PPI (Revise)

Learning PPI Networks


Pathway

Function Implication

Func ?Func A

Protein Complex

PSB 05PROTEINS 06BMC Bioinfo 07CCR 08 ISMB 08

Prepare

Genome Biology 08

PPI Network

Domain/Motif Interactions


Inter species interactome

What are the interacting proteins between two organisms?


HIV-1 host protein interactions

HIV-1 depends on the cellular machinery in every

aspect of its life cycle.

Fusion

Reverse transcription

MaturationBudding

Transcription

Peterlin and Torono, Nature Rev Immu 2003.


HIV-1 host protein interactions

Human protein

HIV protein


FIN

Questions ?

computational proteomics: structure/function prediction & the protein interactome jaime...

Documents

jaime carbonellmotivation

sequence function

protein structure

structure prediction

function sequence drug

gpcrssubfamily classification

helix prediction

structurefunction prediction