data management of networksdownload.e-bookshelf.de/.../82/l-g-0000591282-0002363129.pdf6.3...
TRANSCRIPT
Wiley Series on
Bioinformatics: Computational Techniques and Engineering
Bioinformatics and computational biology involve the comprehensive application of mathematics, statistics, science, and computer science to the understanding of living systems. Research and development in these areas require cooperation among specialists from the fields of biology, computer science, mathematics, statistics, physics, and related sciences. The objective of this book series is to provide timely treatments of the different aspects of bioinformatics spanning theory, new and established techniques, technologies and tools, and application domains. This series emphasizes algorithmic, mathematical, statistical, and computational methods that are central in bioinformatics and computational biology.
Series Editors: Professor Yi Pan Professor Albert Y. [email protected] [email protected]
Knowledge Discovery in Bioinformatics: Techniques, Methods and Applications / Xiaohua Hu & Yi Pan
Grid Computing for Bioinformatics and Computational Biology / Albert Zomaya & El-Ghazali Talbi
Analysis of Biological Networks / Björn H. Junker & Falk Schreiber
Bioinformatics Algorithms: Techniques and Applications / Ion Mandoiu & Alexander Zelikovsky
Machine Learning in Bioinformatics / Yanqing Zhang & Jagath C. Rajapakse
Biomolecular Networks / Luonan Chen, Rui-Sheng Wang, & Xiang-Sun Zhang
Computational Systems Biology / Huma Lodhi
Computational Intelligence and Pattern Analysis in Biology Informatics / Ujjwal Maulik, Sanghamitra Bandyopadhyay, & Jason T. Wang
Mathematics of Bioinformatics: Theory, Practice, and Applications / Matthew He & Sergey Petoukhov
Introduction to Protein Structure Prediction: Methods and Algorithms / Huzefa Rangwala & George Karypis
Data Management of Protein Interaction Networks / Mario Cannataro & Pietro Hiram Guzzi
DATA MANAGEMENT OF PROTEIN INTERACTION NETWORKS
MARIO CANNATAROPIETRO HIRAM GUZZIDepartment of Experimental Medicine and Clinic University Magna Graecia of Catanzaro
A JOHN WILEY & SONS, INC., PUBLICATION
Copyright © 2011 by John Wiley & Sons, Inc. All rights reserved
Published by John Wiley & Sons, Inc., Hoboken, New JerseyPublished simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Cannataro, Mario, 1964- Data management of protein interaction networks / Mario Cannataro, PietroHiram Guzzi. p. cm. – (Wiley series in bioinformatics ; 17) ISBN 978-0-470-77040-5 (hardback) 1. Protein-protein interaction–Information resources. 2. Informationresources management. I. Guzzi, Pietro Hiram, 1980- II. Title. QP551.C346 2012 025.06'572644–dc22 2011010581
Printed in the United States of America
eISBN: 9781118103715oISBN: 9781118103746ePub: 9781118103739MOBI: 978111810372210 9 8 7 6 5 4 3 2 1
To Angela, Francesco, and Matteo.
M.C.
To my sister, my mother, my father, and those who are close to me.
P.H.G.
CONTENTS
LIST OF FIGURES xiii
LIST OF TABLES xix
FOREWORD xxi
PREFACE xxiii
ACKNOWLEDGMENTS xxix
INTRODUCTION xxxi
ACRONYMS xxxiii
1 INTERACTOMICS 1
1.1 InteractomicsandOmicsSciences / 11.2 GenomicsandProteomics / 41.3 RepresentationandManagementofProtein
InteractionData / 51.4 AnalysisofProteinInteractionNetworks / 51.5 VisualizationofProteinInteraction
Networks / 61.6 ModelsforBiologicalNetworks / 7
vii
viii CONTENTS
1.7 FlowofInformationinInteractomics / 81.8 ApplicationsofInteractomicsinBiologyand
Medicine / 101.9 Summary / 11
2 TECHNOLOGIES FOR DISCOVERING PROTEIN INTERACTIONS 13
2.1 Introduction / 132.2 TechniquesInvestigatingPhysical
Interactions / 142.3 TechnologiesInvestigating
KineticDynamics / 172.4 Summary / 18
3 GRAPH THEORY AND APPLICATIONS 21
3.1 Introduction / 213.2 GraphDataStructures / 223.3 Graph-BasedProblems
andAlgorithms / 283.4 Summary / 31
4 PROTEIN-TO-PROTEIN INTERACTION DATA 33
4.1 Introduction / 334.2 HUPOPSI-MI / 344.3 Summary / 41
5 PROTEIN-TO-PROTEIN INTERACTION DATABASES 43
5.1 Introduction / 435.2 DatabasesofExperimentallyDetermined
Interactions / 455.3 DatabasesofPredictedInteractions / 555.4 Metadatabases:Integrationof
PPIDatabases / 625.5 Summary / 70
CONTENTS ix
6 MODELS FOR PROTEIN INTERACTION NETWORKS 71
6.1 Introduction / 716.2 RandomGraphModel / 726.3 Scale-FreeModel / 736.4 GeometricRandomGraphModel / 736.5 StickinessIndex(STICKY)Model / 746.6 Degree-WeightedModel / 746.7 NetworkScoringModels / 756.8 Summary / 76
7 ALGORITHMS ANALYZING FEATURES OF PROTEIN INTERACTION NETWORKS 79
7.1 Introduction / 797.2 AnalysisofProteinInteractionNetworks
throughCentralityMeasures / 807.3 ExtractionofNetworkMotifs / 817.4 IndividuationofProteinComplexes / 887.5 Summary / 99
8 ALGORITHMS COMPARING PROTEIN INTERACTION NETWORKS 101
8.1 Introduction / 1018.2 LocalAlignmentAlgorithms / 1048.3 GlobalAlignmentAlgorithms / 1098.4 Summary / 111
9 ONTOLOGY-BASED ANALYSIS OF PROTEIN INTERACTION NETWORKS 113
9.1 DefinitionofOntology / 1139.2 LanguagesforModelingOntologies / 1159.3 BiomedicalOntologies / 1169.4 Ontology-BasedAnalysisofProtein
InteractionData / 117
x CONTENTS
9.5 SemanticSimilarityMeasuresofProteins / 1209.6 TheGeneOntologyAnnotation
Database(GOA) / 1229.7 FussiMegandProteinOn / 1239.8 Summary / 123
10 VISUALIZATION OF PROTEIN INTERACTION NETWORKS 125
10.1 Introduction / 12510.2 Cytoscape / 12610.3 CytoMCL / 12710.4 NAViGaTOR / 12810.5 BioLayoutExpress3D / 13010.6 Medusa / 13010.7 ProViz / 13110.8 Ondex / 13210.9 PIVOT / 13210.10 Pajek / 13310.11 Graphviz / 13410.12 GraphCrunch / 13410.13 VisANT / 13510.14 PIANA / 13610.15 Osprey / 13610.16 cPATH / 13710.17 PATIKA / 13810.18 Summary / 139
11 CASE STUDIES IN BIOLOGY AND BIOINFORMATICS 141
11.1 AnalysisofanInteractionNetworkfromProteomicData / 141
11.2 ExperimentalComparisonofTwoInteractionNetworks / 143
11.3 Ontology-BasedManagementofPIN(OntoPIN) / 145
CONTENTS xi
11.4 Ontology-BasedPredictionofProteinComplexes / 149
12 FUTURE TRENDS 151
REFERENCES 157
INDEX 177
LIST OF FIGURES
xiii
1.1 FragmentoftheyeastPPInetworkshowinginteractingpartnersoftheMCM1protein.DataareextractedfromtheMINTdatabase. 3
1.2 Flowofinformationininteractomicsfromwet-labexperimentstoknowledge. 8
3.1 Modelingfriendshiprelationsusinggraphs.Thegraphshowsfriendshipsamongfourpeople:Joey,Johnny,Tommy,andDede.JoeyisafriendofDede,Tommy,andJohnny;andDedeisafriendofJohnny,Joey,andTommy. 22
3.2 Exampleofagraphmodelingproteininteractions.Thegraphrepresentsfourproteins:A,B,C,andDandtheinteractions(A,B),(B,C),(B,D),and(C,D). 22
3.3 (a)Undirectedand(b)directedGraphs. 233.4 Bipartitegraph.Redandyellowcolorsrepresent,
respectively,theV1andV2sets. 243.5 Undirectedgraphmodelingasimplenetwork. 253.6 Graphanditsrepresentationasanedgelist.Since
thegraphisundirected,edgesarecomparedonlyonceatime. 25
3.7 Graphanditsincidencematrix. 253.8 Graphanditsadjacencymatrix. 25
xiv LIST OF FIGURES
3.9 Centralitymeasures. 273.10 Nodedegreeascentralitymeasure.Nodecolors
representthenodedegree.Brightcolorsindicatenodeswithalowvalueofnodedegree. 27
3.11 Closenessascentralitymeasure.Brightcolorsindicatenodeswithalowclosenesscentralityvalue. 28
3.12 Betweennessascentralitymeasure.Brightcolorsindicatenodeswithalowcentralityvalue. 29
3.13 Comparisonofgraphtraversalalgorithms. 294.1 SchemaofthePSI-MIXML2.5fileformat.Theroot
ofadocumentisrepresentedbyanentrysetelementthatcontainsoneormoreentries,aself-containedcontainerdescribingalltheinteractions,andtherelatedmetadata. 34
4.2 ProteininteractionextractedfromtheMIPSdatabaseencodedintheHUPOPSI-MIXML2.5format. 36
4.3 InteractionlistsectionofthePSI-MIcoderelativetotheproteinid3807. 37
4.4 WorkflowofdatawithintheIMExconsortium.PartnersofIMExseparatelyproducetheirdata.ThentheymakeavailableallthedatafollowingtheIMExrules.Finally,theendusercanretrievesuchdatabyusingasingleinterfaceavailablethroughtheIMExwebserver. 40
5.1 SnapshotoftheDIPdatabaseshowingtheBRCA1proteininhumansanditsinteractingpartners.DIPpresentsresultsinagraphicformatshowingthegraphconstitutedbytheBRCA1protein(inred)anditsinteractors.UserscanalsonavigatethroughweblinksretrievingfunctionalinformationaboutBRCA1. 47
5.2 SnapshotoftheBINDdatabaseshowingtheBRCA1proteininhumansanditsinteractingpartners.BINDpresentsresultsinasimpletabularformat.UserscanalsousetheinteractionviewerbasedonCytoscapetographicallyexploretheinteractions. 49
5.3 SnapshotoftheMINTdatabaseshowingtheBRCA1proteininhumansanditsinteractingpartners.MINTpresentsresultsbothintabularformat,ontheleft,andinagraphicformat,ontheright,showingthe
LIST OF FIGURES xv
graphconstitutedbytheBRCA1protein(inred)anditsinteractorsthroughanembeddedviewer. 50
5.4 InteractingpartnersoftheYAL035WyeastproteinobtainedbyqueryingtheMIPSdatabase.Theresultinginteractionnetworkcanbevisualizedthroughanintegratedvisualizer(asshownintheboxontheright). 52
5.5 SnapshotoftheIntActdatabaseshowingtheBRCA1proteininhumansanditsinteractingpartners. 54
5.6 SnapshotoftheBioGRIDdatabaseshowingtheBRCA1proteininhumansanditsinteractingpartners.BioGRIDpresentsresultsinagraphicformat.Tablesmaybesortedorcollapsed. 55
5.7 Processofpredictionofprotein–proteininteractions.Startingfromanexistingdataset,thealgorithmsmergeexistingdataandbiologicalknowledge,forexample,codedinbiologicalontologies.Theresultofsuchaprocessistheaccumulationofnewdatastoredinderiveddatabases. 57
5.8 SnapshotoftheI2DresultpageshowingtheBRCA1proteininhumansanditsinteractingpartners.I2Dpresentsresultsinatabularformatasthedefault.ResultsmayberenderedasagraphbyusingNAViGaTOR. 59
5.9 SnapshotoftheIntNetDBdatabaseshowingtheTP53proteininhumansanditsinteractingpartners.IntNetDBpresentsresultsinatabularformatasthedefault.Userscanvisualizethegraphconstitutedbythequeryproteinanditsinteractingpartnersorcandownloaditasavectorialimage. 60
5.10 VisualizationinSTRING:ThenetworkrepresentstheBRCA1queryprotein(representedasarednode)anditsinteractingpartners.Nodesarecoloredbecausetheyaredirectlylinkedtothequeryprotein.Edges,thatis,predictedfunctionallinks,consistofuptoeightlines.Eachcolorrepresentsdifferentevidenceforthatinteraction. 61
5.11 HAPPIdatabaseiscreatedbyextractingproteininteractiondatafromHPRD,BIND,MINT,STRING,andOPHID.Oncecollected,dataare
xvi LIST OF FIGURES
integratedusingdatabaseintegrationtechniques,intoaunifieddatamodel.Finallydataarescoredbyapplyingaunifiedscoringmodelandannotationsarealsocomputationallyderived. 64
5.12 ResultsvisualizationinHAPPI.PagecontainstheBRCA1queryproteinanditsinteractingpartners.Foreachinteractingpartnerthesourceofinteractionandthescoreofconfidencearealsoreported.Otherinformationcanbeobtainedbybrowsingthehyperlinks. 65
5.13 ProcessofcreationoftheAPIDdatabase.ThekeypointoftheintegrationistheunificationofalltheproteinidentifiersusingthecommonacceptedUniprotcodes.Finally,eachinteractionisannotatedbythecalculationofparametersthatindicatethereliabilityoftheinteractionitself. 66
5.14 VisualizationofresultsinAPID.FigurerepresentstheBRCA1queryproteinanditsinteractingpartners(storedinatableinthebackground).APIDalsoenablesthevisualizationofthecorrespondingnetwork.Nodesarecoloredbecausetheyaredirectlylinkedtothequeryprotein. 67
5.15 VisualizationofresultsinMiMI.ThetablerepresentstheBRCA1queryproteinanditsinteractingpartners. 68
5.16 ProcessofcreationofUniHi.UniHifocusesonhumanproteininteractions.Dataareextractedfrommaindatabasesofbothpredictedandexperimentalinteractions. 68
5.17 VisualizationofresultsinUniHi.Resultsarepresentedinatabularwayasdefault,anduserscanalsovisualizeagraph.Thegraph(intheupperrightcorner)representstheBRCA1queryprotein(representedasarednode)anditsinteractingpartners.Nodesarecoloredbecausetheyaredirectlylinkedtothequeryprotein. 69
7.1 Examplesofnetworkmotifs.Linearpathsareindicatedin(a),(b),(c),and(d).Cliquesareindicatedin(f),(g),and(j).Starsandloopsareindicated,respectively,as(e)and(h)and(i). 82
LIST OF FIGURES xvii
7.2 Workflowofextractionofnetworkmotifs. 837.3 Motifsconsideredinpowergraphanalysis. 877.4 FragmentofhumanPPInetworkshowingBRCA1
interactingpartnersextractedfromMINTdatabase. 88
7.5 FragmentofaPPInetworkshowingthestructureofproteincomplexes.Redandyellownodeshighlighttwodensesubregionsthatmayrepresentproteincomplexes. 89
7.6 SimulationoftheevolutionofflowinanetworkasperformedbyMCL. 92
7.7 Workflowoftheexecutionofthepredictionofaproteincomplexthroughclusteringoftheinputnetwork. 93
7.8 Threepossiblewaystocombineinteractions.Letusconsiderfourproteinsandasinglebait(Y),whichisidentifiedtogetherwiththepreviousones.Figuredepictsthreewaystoassigninteractionstoproteins. 96
7.9 WorkflowoftheexecutionofapredictioninProCope. 97
7.10 GUIoftheIMPRECOtool. 998.1 Processofalignmentoftwographs.Inthiscasepairs
ofcorrespondentnodesare(v1,u1),(v2,u2),(v3,u3),(v11,u11),and(v4,u9)(correspondencesareevidencedbyreddottedlines)sothealignmentgraphAlcontainsfivenodesandtherelativeedges. 103
8.2 HomepageofthePathBLASTwebserver. 1058.3 HomepageoftheNetworkBLASTwebserver. 1079.1 Workflowofenrichmentanalysis. 1199.2 ExampleofGOA. 122
10.1 GraphicaluserinterfaceofCytoscape.Themainwindowisusedtovisualizethenetwork.Theboxonthebottomdepictstheannotationsofthenodeswhiletheboxesontheleftoffertotheusersasetoffunctionalities(e.g.,nodeselection). 127
10.2 GraphicaluserinterfaceofCytoMCL.Themainwindow,fullyintegratedintoCytoscape,isusedtoselectthealgorithmparameters.TheboxontheleftdepictsanextractedsubnetworkthatisvisualizedthroughCytoscape. 129
xviii LIST OF FIGURES
11.1 WorkflowofanalysisofaPINreconstructedfromaproteomicexperiment. 142
11.2 Comparativeanalysisoftwointeractionnetworks. 14411.3 Architectureoftheannotateddatabase. 14611.4 Localizationofinteractingproteins. 15012.1 OverallsnapshotofPPIdatamanagement. 152
LIST OF TABLES
xix
2.1 DescriptionofProteinMicroarrays 154.1 CurrentPartnersoftheIMExConsortium 415.1 DIPDatabaseInformation 465.2 BINDDatabaseInformation 485.3 MINTDatabaseInformation 495.4 IntActDatabaseInformation 535.5 BIOGRIDDatabaseInformation 565.6 I2DDatabaseInformation 585.7 IntNetDBDatabaseInformation 595.8 STRINGDatabaseInformation 625.9 HAPPIDatabaseInformation 635.10 APIDDatabaseInformation 655.11 MiMIDatabaseInformation 675.12 UNIHIDatabaseInformation 696.1 ComparisonofRandomGraphand
Scale-FreeModels 7311.1 LocalizationofProteins 148
FOREWORD
xxi
The management and analysis of protein–protein interactions (PPI) is fundamental to the understanding of cellular organizations, pro-cesses, and functions. It has been observed that proteins seldom act as single isolated species in the performance of their functions; rather, proteins involved in the same cellular processes often interact with each other. Therefore, the functions of uncharacterized proteins can be predicted through comparison with the interactions of similar known proteins. A detailed examination of a protein–protein inter-action network can thus yield significant new insights into protein functions. Traditionally, each laboratory experiment observes only a few protein interactions and yields a data set of very limited size. Recent large-scale investigations of protein–protein interactions using such techniques as two-hybrid systems, mass spectrometry, and protein microarrays have enriched the available protein interaction data and facilitated the construction of integrated protein–protein interaction networks. Many protein interaction databases are avail-able. The resulting large volume of protein–protein interaction data has posed a challenge to experimental investigation. Consequently, computational analysis of the networks has become a necessary tool for the determination of functionally associated proteins.
In 2009, I published a book titled Protein Interaction Networks—Computational Analysis (Cambridge University Press), which gave an introduction to the cutting-edge computational approaches to