computational proteomics: structure/function prediction & the protein interactome jaime...
Post on 15-Jan-2016
226 views
TRANSCRIPT
Computational Proteomics: Structure/Function Prediction
& the Protein Interactome
Jaime Carbonell ([email protected]), with Betty Cheng, Yan Liu, Eric Xing, Yanjun Qi, Judith Klein-Seetharaman, and Oznur Tastan
Carnegie Mellon UniversityPittsburgh PA, USA
December, 2008
© 2003, Jaime Carbonell2
Simplified View of Biology
Nobelprize.org
Protein sequence
Protein structure
© 2003, Jaime Carbonell3
Primary SequenceMNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT LCCGKNPLGD DEASTTVSKT ETSQVAPA
3D Structure
Folding
Complex function within network of proteins
Normal
PROTEINSSequence Structure Function
(Borrowed from: Judith Klein-Seetharaman)
© 2003, Jaime Carbonell4
Primary SequenceMNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT LCCGKNPLGD DEASTTVSKT ETSQVAPA
3D Structure
Folding
Complex function within network of proteins
Disease
PROTEINSSequence Structure Function
© 2003, Jaime Carbonell5
Motivation: Protein Structure and Function Prediction
•Ultimate goal: Sequence Function – …and Function Sequence (drug design, …)– Potential active binding sites are a good start, but how
about stability, external accessibility, energetics, …
•Intermediate goal: Sequence Structure– Only 1.2% of proteins have been structurally resolved– What-if analysis (precursor of mutagenesis exp’s)
•Machine Learning & Lang Tech methods– Powertools to model and predict structure & function– ComBio challenges are starting to drive new research
in Machine Learning & Language Technologies
© 2003, Jaime Carbonell6
OUTLINE
•Motivation: sequencestructurefunction•Vocabulary-based classification approaches
(Betty Cheng, Jaime Carbonell, Judith Klein-Seetharaman)– GPRC Subfamily classification– Protein-protein coupling specificity
•Solving the “Folding Problem” Machine Learning Approaches to Structure Prediction (Yan Liu, Jaime Carbonell, et al)
– Teriary folds: β-helix prediction via segmented CRFs– Quaternary Folds: Viral adhesin and capsid complexes
•Conclusions and future directions
© 2003, Jaime Carbonell7
GPRC Super-family:G-Protein Coupled Receptors
• Transmembrane protein
• Target of 60% drugs (Moller, 2002)
• Involved in cancer, cardiovascular disease, Alzheimer’s and Parkinson’s diseases, stroke, diabetes, and inflammatory and respiratory diseases
VII VIC-Terminus
N-Terminus
Intracellular Loops
Extracellular Loops
Membrane
I
II III IVV
© 2003, Jaime Carbonell8
Protein Family & Subfamily Classification (applied to GPCRs)
Subfamily classification based on pharmaceutical properties
© 2003, Jaime Carbonell9
Comparative Study – Karchin et al., 2002
Support Vector Machines, Neural Nets, Clustering
Hidden Markov Models
K-Nearest Neighbours, BLAST
Complex
Simple
SVM is the best for subfamily classification- Karchin et al., 2002
Decision Trees, Naïve Bayes
Traditionally, hidden Markov models, k-nearest neighbours and BLAST have been used.
Recently, more complicated classifiers have been used.
Karchin et al. (2002) studied a range of classifiers of varied complexity in GPCR subfamily classification.
But what about those simple classifiers at the other end of the scale?
Hypothesis: Bio-vocabulary selection is crucial for sub-family
classification (and protein-protein interaction prediction)
© 2003, Jaime Carbonell10
Study “segments” with different vocabulary
AA, chemical groups, properties of AA
© 2003, Jaime Carbonell11
Computing Chi-Square
Cc xc
xcxcx
),e(
),o(),e()(
22
N
tnxc
xc),e(
Observed # of sequences with
feature x
Expected # of sequences
with feature x
Total # of sequences
# of sequences with feature x
# of sequences in
class c
© 2003, Jaime Carbonell12
Level I Subfamily Optimization
Number of Features
Acc
ura
cy
Decision Trees Naïve Bayes
Binary Features
N-gram Counts
© 2003, Jaime Carbonell13
Level I Subfamily Results
Classifier # of Features Type of Features AccuracyNaïve Bayes 5500-7700 Binary 93.0 %
3300-6900 N-gram counts 90.6 %
All (9702) N-gram counts 90.0 %
SVM 9 per match state in the HMM
Gradient of the log-likelihood that the sequence is generated by the given HMM model
88.4 %
BLAST Local sequence alignment 83.3 %
Decision Tree 900-2800 Binary 77.3 %
700-5600 N-gram counts 77.3 %
All (9723) N-gram counts 77.2 %
SAM-T2K HMM A HMM model built for each protein subfamily 69.9 %
kernNN 9 per match state in the HMM
Gradient of the log-likelihood that the sequence is generated by the given HMM model
64.0 %
© 2003, Jaime Carbonell14
Level II Subfamily Results
Classifier # of Features Type of Features Accuracy
Naïve Bayes 8100 Binary 92.4 %
SVM 9 per match state in the HMM
Gradient of the log-likelihood that the sequence is generated by the given HMM model
86.3 %
Naïve Bayes 5600 N-gram counts 84.2 %
SVMtree 9 per match state in the HMM
Gradient of the log-likelihood that the sequence is generated by the given HMM model
82.9 %
Naïve Bayes All (9702) N-gram counts 81.9 %
BLAST Local sequence alignment 74.5 %
Decision Tree 1200 N-gram counts 70.8 %
Decision Tree 2300 Binary 70.2 %
SAM-T2K HMM A HMM model built for each protein subfamily 70.0 %
Decision Tree All (9723) N-gram counts 66.0 %
kernNN 9 per match state in the HMM
Gradient of the log-likelihood that the sequence is generated by the given HMM model
51.0 %
Helix 3 and 7 known to be important for signal transduction
Top 20 selected “words” for Class B GPCRs. They correlate with identified motifs.
Loop 1 is suspected common binding site
© 2003, Jaime Carbonell16
Generalization to Other Superfamilies: Nuclear Receptors
Dataset Feature Type # of Features Accuracy
Validation Testing
Family Binary 1500-4200 96.96% 94.53%
N-grams counts 400-4900 95.75% 91.79%
Level I Subfamily
Binary 1500-3100 98.09% 97.77%
N-gram counts 500-1100 93.95% 91.40%
Level II Subfamily
Binary 1500-2100 95.32% 93.62%
N-gram counts 3100-5600 86.39% 85.54%
© 2003, Jaime Carbonell17
G-Protein Coupling Specificity Problem
• Predict which one or more families of G-proteins a GPCR can couple with, given the GPCR sequence
• Locate regions in the GPCR sequence where the majority of coupling specificity information lies
G-Protein Family Function
GsActivates adenylyl cyclase
Gi/oInhibits adenylyl cyclase
Gq/11Activates phospholipase C
G12/13Unknown
© 2003, Jaime Carbonell18
N-gram Based Component
• Extract n-grams from all possible reading frames
• Use a set of binary k-NN, one for each G-protein family to predict whether the receptor couples to the family
• Predict coupling if k-NN outputs a probability higher than trained threshold
MGNASNDSQSEDCETRQWLPPGESPAI …
Test Sequence01001………51571225
Counts of all n-grams
K-NN Classifier
Pr(coupling to family C) ≥ threshold?
Predict coupling to family C
Predict no coupling to family C
Yes No
© 2003, Jaime Carbonell19
Alignment-Based Component
• A set of binary k-NN, one for each G-protein family to predict whether the receptor couples to the family
• Predict coupling if more than x% of retrieved sequences couple to the family
• 2 parameters:
– Number of neighbours, K– Threshold x%
MGNASNDSQSEDCETRQWLPPGESPAI …
MDNTSNDSQSENREEPLWLPSGESPAIS …
MDNFLNDSKLMEDCKSRQWLLSGESPAI …
MNESYRCQTSTWVERGSSATMGAVLFG …
BLAST
x% of the K1 sequences couple to family C?
Test Sequence
K1 most similar sequences
Predict coupling to family C
Predict no coupling to family C
Yes No
MGNASNDSQSEDCETRQWLPPGESPAI …
MDNTSNDSQSENREEPLWLPSGESPAIS …
MDNFLNDSKLMEDCKSRQWLLSGESPAI …
MNESYRCQTSTWVERGSSATMGAVLFG …
BLAST
x% of the K1 sequences couple to family C?
Test Sequence
K1 most similar sequences
Predict coupling to family C
Predict no coupling to family C
Yes No
© 2003, Jaime Carbonell20
Our Hybrid Method:Combining Alignment and N-grams
MGNASNDSQSEDCETRQWLPPGESPAI …
BLAST K-NN,x% = 100%
Test Sequence
Predict coupling to family C
YesNo
N-gram K-NN
Predict coupling to family C
Yes No
Predict nocoupling to family C
© 2003, Jaime Carbonell21
Evaluation Metrics & Dataset
CA
AcallRe
BA
AecisionPr
DCBA
DAAccuracy
RP
PRF
21
Truth
Predict
Couplings Non-Couplings
Couplings A B
Non-Couplings
C D
(Cao et al., 2003)
81.3% training set
Same test set
© 2003, Jaime Carbonell22
Results on Cao et al. Dataset
Method N-gram Threshold Prec Recall F1
Hybrid 0.66 0.698 0.952 0.805
N-gram 0.34 0.658 0.794 0.719
Cao et al. 0.577 0.889 0.700
Method Max Prec Recall F1
Whole Seq Alignment F1 0.779 0.841 0.809
Hybrid F1 0.775 0.873 0.821
Whole Seq Alignment Precision 0.793 0.730 0.760
Hybrid Precision 0.803 0.778 0.790
• Suggests n-grams contain information not found in alignment
• Hybrid method outperformed Cao et al. in precision, recall and F1• Suggests alignment contains information not found in n-grams
© 2003, Jaime Carbonell23
Feature Selection of N-grams
• Pre-processing step to remove noisy or redundant features that may confuse classifier
• Many feature selection algorithms available
• Chi-square was used because of success in GPCR subfamily classification
MGNASNDSQSEDCETRQWLPPGESPAI …
K-NN Classifier
Test Sequence
Pr(coupling to family C) =threshold?
Predict coupling to family C
Predict no coupling to family C
Yes No
01001………51571225
001……10200
Chi-Square Feature Selection
Counts of all n-grams
Selected n-gram counts
MGNASNDSQSEDCETRQWLPPGESPAI …
K-NN Classifier
Test Sequence
Pr(coupling to family C) =threshold?
Predict coupling to family C
Predict no coupling to family C
Yes No
01001………51571225
001……10200
Chi-Square Feature Selection
Counts of all n-grams
Selected n-gram counts
© 2003, Jaime Carbonell24
IC Domain Combination Analysis
IC Prec Rec F1 Acc
1 0.782 0.703 0.739 0.796
2 0.820 0.799 0.808 0.845
3 0.661 0.721 0.682 0.730
4 0.632 0.755 0.670 0.694
1, 2 0.820 0.805 0.811 0.847
1, 3 0.799 0.765 0.780 0.825
1, 4 0.780 0.755 0.765 0.807
2, 3 0.837 0.825 0.828 0.861
2, 4 0.828 0.816 0.821 0.853
3, 4 0.773 0.807 0.788 0.821
1, 2, 3 0.822 0.814 0.816 0.850
1, 2, 4 0.807 0.809 0.807 0.843
1, 3, 4 0.792 0.807 0.797 0.832
2, 3, 4 0.839 0.820 0.828 0.861
1, 2, 3, 4 0.824 0.813 0.817 0.853
• Of the 4 domains, 2nd domain yielded best F1 followed by 1st, 3rd and 4th domains
• Most information in IC1 already found in IC2
© 2003, Jaime Carbonell25
Tertiary Protein Fold Prediction
• Protein function strongly modulated by structure• Predicting folds, domains and other regular structures
requires modeling local and long distance interactions in low-homology sequences
– Long distance: Not addressed by n-grams, HMMs, etc.– Low homology: Not address by BLAST algorithms
• We focus on minimal mathematical structural modeling– Segmented conditional random fields– Layered graphical models– Fully trainable to recognize new instances of structures
• First acid-test: β-helix super-secondary structural prediction (with data and guidance from Prof. J. King at MIT)
© 2003, Jaime Carbonell26
Protein Structure Determination
• Lab experiments: time, cost, uncertainty, …– X-ray crystallography (months to crystalize, uncertain outcome) Nobel Prize, Kendrew & Perutz, 1962
– NMR spectroscopy (only works for small proteins or domains)Nobel Prize, Kurt Wuthrich, 2002
• The gap between sequence and structure necessitates computational methods of protein structure determination– 3,023,461 sequences v.s. 36,247 resolved structures (1.2%)
1MBN
1BUS
Predicting Protein Structures
• Protein Structure is a key determinant of protein function
• Crystalography to resolve protein structures experimentally in-vitro is very expensive, NMR can only resolve very-small proteins
• The gap between the known protein sequences and structures:
– 3,023,461 sequences v.s. 36,247 resolved structures (1.2%)
– Therefore we need to predict structures in-silico
© 2003, Jaime Carbonell28
Predicting Tertiary Folds
• Super-secondary structures
– Common protein domains and scaffolding patterns such as regular combinations of β-sheets and/or -helices
• Out task
– Given a protein sequence, predict supersecondary structures and their components (e.g. β-helices and the location of each rung therein)
• Examples:– Parallel Right-handed β-helix Leucine-rich repeats
© 2003, Jaime Carbonell29
Parallel Right-handed β-Helix
• Structure– A regular super-secondary structure with an
an elongated helix whose successive rungs are composed of beta-strands
– Highly-conserved T2 turn
• Computational importance– Long-range interactions
– Repeat patterns
• Biological importance– functions such as the bacterial infection of
plants, binding the O-antigen and etc.
© 2003, Jaime Carbonell30
Conditional Random Fields
• Hidden Markov model (HMM) [Rabiner, 1989]
• Conditional random fields (CRFs) [Lafferty et al, 2001]
– Model conditional probability directly (discriminative models, directly optimizable)
– Allow arbitrary dependencies in observation – Adaptive to different loss functions and regularizers– Promising results in multiple applications– But, need to scale up (computationally) and extend to long-
distance dependencies
11
( ) ( | ) ( | )N
i i i ii
P P x y P y y
x, y
11 10
1( ) exp( ( , , , ))
N K
k k i ii k
P f i y yZ
y | x x
© 2003, Jaime Carbonell31
• Outputs Y = {M, {Wi} }, where Wi = {pi, qi, si}
• Feature definition
– Node feature
– Local interaction feature
– Long-range interaction feature
Our Solution: Conditional Graphical Models
1 1 1( , , ) ( , ', 1)k i i i i i if w w x I s s s s p q
( , ) '( , , ) ( ', 1 ')k i k i i i i if w x f x p q I s s q p d
Long-range dependencyLocal dependency
1( , , ) '( , , , , ) ( , ')k i j k i i j j i if w w x g x p q p q I s s s s
© 2003, Jaime Carbonell32
Linked Segmentation CRF
• Node: secondary structure elements and/or simple fold
• Edges: Local interactions and long-range inter-chain and intra-chain interactions
• L-SCRF: conditional probability of y given x is defined as
, , ,
1 1 , , ,,
1( ,..., | ,..., ) exp( ( , )) exp( ( , , , ))
i j G i j a b G
R R k k i i j l k i a i j a bV k lE
P f g yZ
y y y
y y x x x y x x y
Joint Labels
© 2003, Jaime Carbonell33
• Classification:
• Training : learn the model parameters λ
– Minimizing regularized negative log loss
– Iterative search algorithms by seeking the direction whose empirical values agree with the expectation
• Complex graphs results in huge computational complexity
Linked Segmentation CRF (II)
( | )( ( , ) [ ( , )]) ( ) 0G
k c p k cc Ck
Lf E f
y xx y x y
21
( , ) log ( )G
K
k k cc C k
L f Z
x y
1
* arg max ( , )G
K
k k cc C k
y f Y
x
© 2003, Jaime Carbonell34
Model Roadmap
Conditional random fields [lafferty et al, 2001]
Segmentation CRFs (Liu & Carbonell 2005)
Chain graph model (Liu, Xing & Carbonell, 2006)
Linked segmentation CRFs (Liu & Carbonell, 2007)
Long-range
Trade-off between local and long-range
Inter-chain long-range
Semi-markov CRFs [Sarawagi & Cohen, 2005]
Beyond Markov dependencies
Generalized discriminative graphical models
© 2003, Jaime Carbonell35
Tertiary Fold Recognition: β-Helix fold
• Histogram and ranks for known β-helices against PDB-minus dataset
5
Chain graph model reduces the real running time of SCRFs model by around 50 times
© 2003, Jaime Carbonell36
Fold Alignment Prediction: β-Helix
• Predicted alignment for known β -helices on cross-family validation
© 2003, Jaime Carbonell37
Discovery of New Potential β-helices
• Run structural predictor seeking potential β-helices from Uniprot (structurally unresolved) databases
– Full list (98 new predictions) can be accessed at www.cs.cmu.edu/~yanliu/SCRF.html
• Verification on 3 proteins with later experimentally resolved structures from different organisms
– 1YP2: Potato Tuber ADP-Glucose Pyrophosphorylase
– 1PXZ: The Major Allergen From Cedar Pollen
– GP14 of Shigella bacteriophage as a β-helix protein
– No single false positive!
© 2003, Jaime Carbonell38
Predicting Quaternary Folds
• Triple beta-spirals [van Raaij et al. Nature 1999]
– Virus fibers in adenovirus, reovirus and PRD1
• Double barrel trimer [Benson et al, 2004]
– Coat protein of adenovirus, PRD1, STIV, PBCV
© 2003, Jaime Carbonell39
Features for Protein Fold Recognition
© 2003, Jaime Carbonell40
Experiment Results: Quaternary Fold Recognition
Double barrel-trimersTriple beta-spirals
© 2003, Jaime Carbonell41
Experiment Results: Alignment Prediction
Triple beta-spirals
Four states: B1, B2, T1 and T2
Correct Alignment:
B1: i – o B2: a - h
Predicted Alignment
B1 B2
© 2003, Jaime Carbonell42
Experiment Results:Discovering New Membership Proteins
• Predicted membership proteins of triple beta-spirals can be accessed at
http://www.cs.cmu.edu/~yanliu/swissprot_list.xls
• Membership proteins of double barrel-trimer suggested by biologists [Benson, 2005] compared with L-SCRF predictions
© 2003, Jaime Carbonell43
Conclusions & Challenges for Protein Structure/Function Prediction
• Methods from modern Machine Learning and Language Technologies really work in Computational Proteomics
– Family/subfamily/sub-subfamily predictions– Protein-protein interactions (GPCRs G-proteins)– Accurate tertiary & quaternary fold structural predictions
• Next generation of model sophistication…• Addressing new challenges
– Structure Function: Structural predictions combined with binding-site & specificity analysis
– Predictive Inversion: Function Structure Sequence for new hyper-specific drug design (anti-viral, oncology)
© 2003, Jaime Carbonell44
Proteins and Interactions
• Every function in the living cell depends on proteins
• Proteins are made of a linear sequence of amino acids and folded into unique 3D structures
• Proteins can bind to other proteins physically
– Enables them to carry out diverse cellular functions
© 2003, Jaime Carbonell45
Protein-Protein Interaction (PPI) Network
• PPIs play key roles in many biological systems
• A complete PPI network (naturally a graph)
– Critical for analyzing protein functions & understanding the cell
– Essential for diseases studies & drug discoveries
© 2003, Jaime Carbonell46
PPI Biological Experiments
• Small-scale PPI experiments One protein or several proteins at a time Small amount of available data Expensive and slow lab process
• Large-scale PPI experiments Hundreds / thousands of proteins at a time Noisy and incomplete data Little overlap among different sets
Large portion of the PPIs still missing or noisy !
© 2003, Jaime Carbonell47
Learning of PPI Networks
• Goal I: Pairwise PPI (links of PPI graph)– Most protein-protein interactions (pairwise) have not been
identified or noisy Missing link prediction !
• Goal II: “Complex” (important groups)– Proteins often interact stably and perform functions together as
one unit (“complex” )
– Most complexes have not be discovered Important group detection !
Pairwise Interactions
Protein ComplexPPI NetworkLink Prediction
Group Detection
© 2003, Jaime Carbonell48 48
Goal I: Missing Link Prediction
Pairwise Interactions
PPI Network
© 2003, Jaime Carbonell49
Related Biological Data
• Overall, four categories:
– Direct high-throughput experimental data: Two-hybrid screens (Y2H) and mass spectrometry (MS)
– Indirect high throughput data: Gene expression, protein-DNA binding, etc.
– Functional annotation data: Gene ontology annotation, MIPS annotation, etc.
– Sequence based data sources: Domain information, gene fusion, homology based PPIs, etc.
direct
Indirect
Utilize implicit evidence and available direct experimental results together
© 2003, Jaime Carbonell50
Related Data Evidence
Relational Evidence Between Proteins
1 Synthetic lethal
Attribute Evidence of Each Protein
Expression
Structure
Sequence
Annotation
……
……
Relation expanding1
© 2003, Jaime Carbonell51
Feature Vector for (Pairwise) Pairs
– For data representing protein-protein pairs, use directly
– For data representing single protein (gene), calculate the (biologically meaningful) similarity between two proteins for each evidence
Synthetic lethal: 1……
Sequence SimilarityGeneExp CorrelationCoeff…
Pair A-B: fea1, fea2, fea3, …….
Sequence: mtaaqaagee…
GeneExp: 233.94, 162.85, ...
….
Sequence: mrpsgtagaa…
GeneExp: 109.4, 975.3, ...
…
Protein B Protein A
Pair A-B
© 2003, Jaime Carbonell52
Problem Setting
• For each protein-protein pair: – Target function: interacts or not ?– Treat as a binary classification task
• Feature Set
– Feature are heterogeneous
– Most features are noisy
– Most features have missing values
• Reference Set:
– Small-scale PPI set as positive training (hundreds thousands)
– No negative set (non-interacting pairs) available
– Highly skewed class distribution» Much more non-interacting pairs than interacting pairs
» Estimated: 1 out of ~600 yeast; 1 out of ~1000 human
© 2003, Jaime Carbonell53
PPI Inference via ML Methods
• Jansen,R., et al., Science 2003– Bayes Classifier
• Lee,I., et al., Science 2004– Sum of Log-likelihood Ratio
• Zhang,L., et al., BMC Bioinformatics 2004– Decision Tree
• Bader J., et al., Nature Biotech 2004– Logistic Regression
• Ben-Hur,A. et al., ISMB 2005– Kernel Method
• Rhodes DR. et al., Nature Biotech 2005– Naïve Bayes
Present focus: Y. Qi, Z. Bar-Joseph, J. Klein-Seetharaman, Proteins 2006
© 2003, Jaime Carbonell54
Predicting Pairwise PPIs
– Prediction target (three types)» Pphysical interaction,
» Co-complex relationship,
» Pathway co-membership inference
– Feature encoding » (1) “detailed” style, and (2) “summary” style
» Feature importance varies
– Classification methods» Random Forest & Support Vector Machine
Y. Qi, Z. Bar-Joseph, J. Klein-Seetharaman, Proteins 2006
Details in the paper
© 2003, Jaime Carbonell55
Human Membrane Receptors
Ligands
Signal Transduction Cascades
extracellularOther Membrane Proteinstransmembrane
cytoplasmic
Type I Type II (GPCR)
PPI Predictions for Human Membrane Receptors
• A combined approach
– Binary classification
– Global graph analysis
– Biological feedback & validation
Y. Qi, et al 2008
© 2003, Jaime Carbonell57
• Random Forest Classifier– A collection of independent decision trees ( ensemble classifier)
– Each tree is grown on a bootstrap sample of the training set
– Within each tree’s training, for each node, the split is chosen from a bootstrap sample of the attributes
Binary Classification
GeneExpress
TAP
Y2H
GOProcess N HMS_PCI N
GeneOccur Y GOLocalization Y
ProteinExpress
GeneExpress
Gene Express
Domain
Y2HHMS-PCI SynExpress ProteinExpress
• Robust to noisy features• Can handle different types of features
© 2003, Jaime Carbonell58
• Compare Classifiers • Receptor PPI (sub-network) to general human PPI prediction
Classifier Comaparison
(27 features extracted from 8 different data sources, modified with biological feedbacks)
Global Graph Analysis
• Degree distribution / Hub analysis / Disease checking
• Graph modules analysis (from bi-clustering study)
• Protein-family based graph patterns (receptors / receptors subclasses / ligands / etc )
59
© 2003, Jaime Carbonell60
Global Graph Analysis
Network analysis reveals interesting features of the human membrane receptor PPI graph
60
For instance:
• Two types of receptors (GPCR and non-GPCR (Type I))
• GPCRs less densely connected than non-GPCRs(Green: non-GPCR receptors; blue: GPCR)
© 2003, Jaime Carbonell61 61
Experimental Validation
• FFive predictions were chosen for experiments and three were verified
– EGFR with HCK (pull-down assay)
– EGFR with Dynamin-2 (pull-down assay)
– RHO with CXCL11 (functional assays, fluorescence spectroscopy, docking)
– Experiments @ U.Pitt School of Medicine
Y. Qi, et al 2008
Details in the paper
© 2003, Jaime Carbonell62 62
Motivation
• Current situation of PPI task
– Only a small positive (interacting) set available
– No negative (not interacting) set available
– Highly skewed class distribution» Much more non-interacting pairs than interacting pairs
– The cost for misclassifying an interacting pair is higher than for a non-interacting pair
– Accuracy measure is not appropriate here
• Try to handle this task with ranking
– Rank the known positive pairs as high as possible
– At the same time, have the ability to rank the unknown positive pairs as high as possible
© 2003, Jaime Carbonell63
Split Features into Multi-View
• Overall, four feature groups:
– P: Direct highthroughput experimental data: Two-hybrid screens (Y2H) and mass spectrometry (MS)
– E: Indirect high throughput data: Gene expression, protein-DNA binding, etc.
– F: Functional annotation data: Gene ontology annotation, MIPS annotation, etc.
– S: Sequence based data sources: Domain information, gene fusion, homology based PPIs, etc.
Direct
Genomic
Functional
Sequence
Y. Qi, J. Klein-Seetharaman, Z. Bar-Joseph, BMC Bioinformatics 2007
© 2003, Jaime Carbonell64
Mixture of Feature Experts (MFE)
• Make protein interaction prediction by
– Weighted voting from the four roughly homogeneous feature categories
– Treat each feature group as a prediction expert
– The weights are also dependent on the input example
• Hidden variable, M modulates the choice of expert
P
FS
E Interact ?
M
XMpMXYpXYp )|(),|()|(
© 2003, Jaime Carbonell65
Mixture of Four Feature Experts
• Parameters are trained using EM
• Experts and root gate use logistic regression (ridge estimator)
Expert P Direct PPI High throughput
Experiment Data
Expert F Function Annotation
of Proteins
Expert S Sequence or Structure
based Evidence
Expert E Indirect High throughput
Experimental Data
4
1
)()()()()()()( ),1,|(*),|1()|(i
in
innnn
inn wmxypvxmpxyp
),( vwi
© 2003, Jaime Carbonell66
Mixture of Four Feature Experts
• Handling missing value
– Add additional feature column for each feature having low feature coverage
– MFE uses present / absent information when weighting different feature groups
• The posterior weight for expert i in predicting pair n
– The weight can be used to indicate the importance of that feature view ( expert ) for this specific pair
4
1
)()()()()(
)()()()()()()()()(
),1,|(*),|1(
),1,|(*),|1(),,,|1(
j
tj
nj
nntnnj
ti
ni
nntnnittnnn
in
i
wmxypvxmP
wmxypvxmPwvxymPh
© 2003, Jaime Carbonell67
Performance
• 162 features for yeast physical PPI prediction task
• Features extracted in “detail” encoding
• Under “detail” encoding, the ranking method is almost the same as RF (not shown)
© 2003, Jaime Carbonell68
Functional Expert Dominates
Figure: The frequency at which each of the four experts has maximum contribution among validated and predicted pairs
300 candidate protein pairs
51 predicted interactions
33 validated already
18 newly predicted
© 2003, Jaime Carbonell69
Protein Complex
• Proteins form associations with multiple protein binding partners stably (termed “complex”)
• Complex member interacts with part of the group and work as an unit together
• Identification of these important sub-structures is essential to understand activities in the cell
Group detection within the PPI network
© 2003, Jaime Carbonell70
Identify Complex in PPI Graph
• PPI network as a weighted undirected graph
– Edge weights derived from supervised PPI predictions:
• Previous work
– Unsupervised graph clustering style
– All rely on the assumption that complexes correspond to the dense regions of the network
• Related facts
– Many other possible topological structures
– A small number of complexes available from reliable experiments
– Complexes also have functional /biological properties (like weight / size / …)
© 2003, Jaime Carbonell71
Possible topological structures
Edge weight color coded
• Make use of the small number of known complexes supervised• Model the possible topological structures subgraph statistics• Model the biological properties of complexes subgraph features
© 2003, Jaime Carbonell72
Properties of Subgraph
• Subgraph properties as features in BN
– Various topological properties from graph
– Biological attributes of complexes
No. Sub-Graph Property
1 Vertex Size
2 Graph Density
3 Edge Weight Ave / Var
4 Node degree Ave / Max
5 Degree Correlation Ave / Max
6 Clustering Coefficient Ave / Max
7 Topological Coefficient Ave / Max
8 First Two Eigen Value
9 Fraction of Edge Weight > Certain Cutoff
10 Complex Member Protein Size Ave / Max
11 Complex Member Protein Weight Ave / Max
5/14/2008
© 2003, Jaime Carbonell73
Model Complex Probabilistically
• Bayesian Network (BN)
– C : If this subgraph is a complex (1) or not (0)
– N : Number of nodes in subgraph
– Xi : Properties of subgraph
C
N
X X X X
),...,,,|0(
),...,,,|1(log
21
21
m
m
xxxncp
xxxncpL
Assume a probabilistic model (Bayesian Network) for representing complex sub-graphs
© 2003, Jaime Carbonell74
Model Complex Probabilistically
• BN parameters trained with MLE
– Trained from known complexes and random sampled non-complexes
– Discretize continuous features
– Bayesian Prior to smooth the multinomial parameters
• Evaluate candidate subgraphs with the log ratio score L
m
kk
m
kk
m
m
cnxpcnpcp
cnxpcnpcp
xxxncp
xxxncpL
1
1
21
21
)0,|()0|()0(
)1,|()1|()1(log
),...,,,|0(
),...,,,|1(log
© 2003, Jaime Carbonell75
Experimental Setup
• Positive training data:
– Set1: MIPS Yeast complex catalog: a curated set of ~100 protein complexes
– Set2: TAP05 Yeast complex catalog: a reliable experimental set of ~130 complexes
– Complex size (nodes’ num.) follows a power law
• Negative training data
– Generate from randomly selected nodes in the graph
– Size distribution follows the same power law as the positive complexes
© 2003, Jaime Carbonell76
Evaluation
• Train-Test style (Set1 & Set2)
• Precision / Recall / F1 measures
• A cluster “detects” a complex if
A : Number of proteins only in clusterB : Number of proteins only in complexC : Number of proteins shared
If overlapping threshold p set as 50%
A C B
Detected
Cluster Known comple
x
pCA
C
p
CB
C
&
© 2003, Jaime Carbonell77
Performance Comparison
• On yeast predicted PPI graph (~2000 nodes)
• Compare to a popular complex detection package: MCODE (search for highly interconnected regions)
• Compare to local search relying on density evidence only
• Compared to local search with complex score from SVM (also supervised)
Methods Precision Recall F1
Density MCODESVMBN
0.1800.2190.2110.266
0.4620.0750.3770.513
0.2530.1110.2690.346
© 2003, Jaime Carbonell78
Human-PPI (Revise 08)HIV-Human PPI (Revise)
Learning PPI Networks
Pairwise Interactions
Pathway
Function Implication
Func ?Func A
Protein Complex
PSB 05PROTEINS 06BMC Bioinfo 07CCR 08 ISMB 08
Prepare
Genome Biology 08
PPI Network
Domain/Motif Interactions
© 2003, Jaime Carbonell79
Inter species interactome
What are the interacting proteins between two organisms?
© 2003, Jaime Carbonell80
HIV-1 host protein interactions
HIV-1 depends on the cellular machinery in every
aspect of its life cycle.
Fusion
Reverse transcription
MaturationBudding
Transcription
Peterlin and Torono, Nature Rev Immu 2003.
© 2003, Jaime Carbonell81
HIV-1 host protein interactions
Human protein
HIV protein
© 2003, Jaime Carbonell82
FIN
Questions ?