tools and algorithms in bioinformatics · 2013-10-18 · bind 80,378 all major model species...
TRANSCRIPT
1
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
Tools and Algorithms in Bioinformatics GCBA815, Fall 2013
Week6: Interaction Network Analysis
(Cytoscape)
Babu Guda Department of Genetics, Cell Biology and Anatomy
University of Nebraska Medical Center
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
Background:
• Gene products (RNA/proteins) rarely work alone; most often they interact with other gene products to accomplish a task
• Most of the cellular processes are regulated by protein-protein or DNA/RNA-protein complexes
• Impaired protein interactions can be causative factors for diseases or metabolic abnormalities
• Guilt by association: The unknown function of a protein can be inferred based on the proteins it interacts with, if those proteins have a known function
• The field of protein-protein interactions (PPIs) is rapidly advancing at various fronts of biomedical research.
2
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
Yeast interactome Wuchty et al., 2003; Barabasi and Oltvai, 2004
Red: lethal
Green: non-lethal
Orange: slow growth
Yellow: Unknown
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
Human interactome Ulrich et al., 2005, reproduced from Cell Journal
• Screened 25 million PPIs • Found 3186 PPIs among 1705 proteins • Maps 195 disease proteins to new partners • Functional annotation of 342 uncharacterized human proteins
Light blue: known proteins
Orange: Disease proteins
Yellow: Uncharacterized
3
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
Types of protein-protein interactions
• Permanent and transient interactions
• Direct (physical) and indirect (sharing a common partner) interactions
• Homo and hetero interactions
• Interchain and intrachain interactions
• Binary interactions and complexes
• Spoke and matrix models to expand binary interactions from complexes
• Interlogs (shared PPIs across species)
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
Adapted from PLoS Computational Biology, Shoemaker and Panchenko, 2007, 3:E42
Experimental methods for identifying PPIs
4
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
Computational methods for predicting PPIs
Adapted from PLoS Computational Biology, Shoemaker and Panchenko, 2007, 3:E43
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
PPI databases and their characteristics
Interaction Database
URL Comments on the source and type of data covered
BIND http://www.bind.ca/Action All major model species covered.
BioGRID http://www.thebiogrid.org Mixture of invivo, invitro and Y2H interactions from different sources. Major species covered are Yeast, Drosophila, C.elegans and Human.
DIP http://dip.doe-mbi.ucla.edu/ Mostly Y2H studies, all major model species covered.
HPRD http://www.hprd.org Only Human, manually curated from the literature
IntAct http://www.ebi.ac.uk/intact Mainly literature-curated. All major model species covered
MINT http://mint.bio.uniroma2.it/mint Both experimental and literature-based, major species covered are Yeast, Drosophila and Human.
OPHID http://ophid.utoronto.ca/ophid Only Human (Experimental and predicted)
PRISM http://gordion.hpc.eng.ku.edu.tr/prism
Predicted interactions based on interacting surfaces in X-ray crystal structures
STRING http://string.embl.de Mostly predicted interactions based on multiple criteria
5
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
Domain-Domain Interactions (DDIs)
PPIs à DDIs
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
PPIs
6
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
• PROSITE : A database of protein profiles and patterns • PRODOM : PROtein DOMain Database-built from UNIPROT • PRINTS: A Compendium of Protein Fingerprints • PFAM : Protein families database of alignments and HMMs • TIGRfams: Protein families based on HMMs • SMART: Simple Modular Architectural Research Tool • BLOCKS: Blocks WWW Server obtained from PROSITE • PANTHER: Protein Analysis Through Evolutionary Relationships • CATH: Class Architecture, Topology & Homologous super family • SCOP: Structural Classification of Proteins • Superfamily: Structural and Functional Protein Annotation • Gene3D: Domain Architecture Classification • INTERPRO: Integrated Resource of Protein Domains and Functional Sites
Protein Domain databases and Interpro
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
Domain architecture of Human EGF protein Family Pal and Guda, 2006, BMC Evolutionary Biology 6:91
7
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
Functional architecture of EGF Receptor protein
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
Significance of studying DDIs • Protein-protein interaction (PPI) data is available as binary data, i.e., an interaction is found’ or ‘not found’.
• About 70% of eukaryotic proteins are multi-domain proteins. In these cases, it is difficult to know which domains actually participate in each interaction.
• Studying interactions at the domain level is vital for understanding the functional significance of PPIs.
• Experimental determination of all DDIs is tedious, hence computational methods can be used to infer DDIs in PPIs and thus can complement experimental investigations.
GR Riggs et al, 2003, EMBO 22:1158
8
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
Flow diagram showing the steps involved in the method development Guda et al., 2009, PLoSONE, 4:e5096
STEPS
• Datasets: positive and negative PPIs
• Domain mapping
• Scoring features and scoring function
• Testing and validation
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
Experimental PPI datasets (Pint) from 5 source databases
Interaction Database
Number of PPIs
Comments on the source and type of data covered
BIND 80,378 All major model species covered.
DIP 53,778 Mostly Y2H studies, all major model species covered.
HPRD 34,367 Only Human, manually curated from the literature
IntAct 125,792 Mainly literature-curated. All major model species covered
MINT 115,383 Both experimental and literature-based, major species covered are Yeast, Drosophila and Human.
Combined unique set
209,165 A non-redundant set of PPIs corresponding to 70,769 unique proteins was obtained by combining the above 5 datasets.
9
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
Domain Mapping
• Domain Definitions were obtained from the InterPro database that integrates 10 distinct domain databases such as Pfam, Prosite, SMART, Superfamily, etc. Out of 15,064 domain in the InterPro database, 10,389 (~70%) were used in this study.
Positive DDI dataset for validation: • About 4000 known DDIs were used from the iPfam database. • The iPfam was created based on domain-domain contacts in solved protein structures and complexes. This dataset has been extensively used as a ‘gold standard’ for validating computational prediction methods.
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
Five orthogonal scoring features used in this study
1. Probabilistic – Ratio of expected frequency of occurrence
2. Evolutionary – Co-occurrence in Rosetta stone proteins
3. Evidence-based – Observed in multiple species
4. Spatial – Co-localized in the same subcellular location
5. Functional – Semantic similarity of GO annotation
( ) ( )5
1ij ij
kFinalScore d Score Sk
=
=∑
10
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
Validation using positive and negative datasets
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
Validation of the method using ROC curves
11
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
Comparison of predictive performance against existing methods
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
Domain-Domain Interaction Network of Breast Cancer Proteins
12
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
BRCT-1% FPR
BRCT-5% FPR
BRCT-10% FPR
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
Using Graph theory to study Biological Interaction Networks
• Graphs are data structures that provide the framework to represent biological networks
• Nodes or Vertices (singular – vertex) are the building blocks for graphs
• An edge is the connection between two nodes • A leaf node is a terminal node in a graph that is connected at
only one end. • Degree is a node attribute that describes how many times a
node is connected to other nodes • Both nodes and edges can have weights that shows their
relative importance in a graph; used for quantitative modeling of networks
13
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
Basic Network terminology
• Directed and undirected graphs • Cyclical and linear graphs • Complete and incomplete graphs • Hub nodes • Subgraphs • Graph centrality • Shortest path • Graph density • Power law distribution
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
Cytoscape
• http://www.cytoscape.org • Interaction network visualization and analysis software, first
published in 2003 from Trey Ideker’s group • Open-source tool with active developer support • Cytoscape version 2.8 • Cytoscape version 3.0 is a newly released with new features • Available for all platforms (Mac, PC, Linux) • Contains extensive collection of Plugins to analyze a variety
of datasets from Biology, social sciences and semantic web • Integrated with other tools such as the R package
14
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
What can you do with Cytoscape?
• Network Visualization • Load molecular and genetic interaction datasets
• Protein-protein interaction data • Protein-DNA interactions • KEGG pathways • Expression datasets
• Network Analysis (mostly using Plugins) • Analyze networks
• Network properties (degree distribution etc) • Annotation-based filtering (subcellular mapping) • Node and edge attribute analysis
__________________________________________________________________________________________________ 10/18/2013 GCBA 815
How to use Cytoscape? • Register and download the software from Cytoscape
• http://www.cytoscape.org • Install on your local computer (PC/Mac/Linux) • Locate the folder (Program files) where files are stored • Use example datasets
• .sif files are network input files • A pp B or A pd C
• node or edge attribute files • .cys files are cytoscape session files (contains info
on network, attribute and session option data) • Other formats: Text, Excel, GML, XGMML, SBML,
BioPax, PSI_MI files