data management of networksdownload.e-bookshelf.de/.../82/l-g-0000591282-0002363129.pdf6.3...

23

Upload: lycong

Post on 17-Apr-2018

218 views

Category:

Documents


4 download

TRANSCRIPT

DATA MANAGEMENT OF PROTEIN INTERACTION NETWORKS

Wiley Series on

Bioinformatics: Computational Techniques and Engineering

Bioinformatics and computational biology involve the comprehensive application of mathematics, statistics, science, and computer science to the understanding of living systems. Research and development in these areas require cooperation among specialists from the fields of biology, computer science, mathematics, statistics, physics, and related sciences. The objective of this book series is to provide timely treatments of the different aspects of bioinformatics spanning theory, new and established techniques, technologies and tools, and application domains. This series emphasizes algorithmic, mathematical, statistical, and computational methods that are central in bioinformatics and computational biology.

Series Editors: Professor Yi Pan Professor Albert Y. [email protected] [email protected]

Knowledge Discovery in Bioinformatics: Techniques, Methods and Applications / Xiaohua Hu & Yi Pan

Grid Computing for Bioinformatics and Computational Biology / Albert Zomaya & El-Ghazali Talbi

Analysis of Biological Networks / Björn H. Junker & Falk Schreiber

Bioinformatics Algorithms: Techniques and Applications / Ion Mandoiu & Alexander Zelikovsky

Machine Learning in Bioinformatics / Yanqing Zhang & Jagath C. Rajapakse

Biomolecular Networks / Luonan Chen, Rui-Sheng Wang, & Xiang-Sun Zhang

Computational Systems Biology / Huma Lodhi

Computational Intelligence and Pattern Analysis in Biology Informatics / Ujjwal Maulik, Sanghamitra Bandyopadhyay, & Jason T. Wang

Mathematics of Bioinformatics: Theory, Practice, and Applications / Matthew He & Sergey Petoukhov

Introduction to Protein Structure Prediction: Methods and Algorithms / Huzefa Rangwala & George Karypis

Data Management of Protein Interaction Networks / Mario Cannataro & Pietro Hiram Guzzi

DATA MANAGEMENT OF PROTEIN INTERACTION NETWORKS

MARIO CANNATAROPIETRO HIRAM GUZZIDepartment of Experimental Medicine and Clinic University Magna Graecia of Catanzaro

A JOHN WILEY & SONS, INC., PUBLICATION

Copyright © 2011 by John Wiley & Sons, Inc. All rights reserved

Published by John Wiley & Sons, Inc., Hoboken, New JerseyPublished simultaneously in Canada

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data:

Cannataro, Mario, 1964- Data management of protein interaction networks / Mario Cannataro, PietroHiram Guzzi. p. cm. – (Wiley series in bioinformatics ; 17) ISBN 978-0-470-77040-5 (hardback) 1. Protein-protein interaction–Information resources. 2. Informationresources management. I. Guzzi, Pietro Hiram, 1980- II. Title. QP551.C346 2012 025.06'572644–dc22 2011010581

Printed in the United States of America

eISBN: 9781118103715oISBN: 9781118103746ePub: 9781118103739MOBI: 978111810372210 9 8 7 6 5 4 3 2 1

To Angela, Francesco, and Matteo.

M.C.

To my sister, my mother, my father, and those who are close to me.

P.H.G.

CONTENTS

LIST OF FIGURES xiii

LIST OF TABLES xix

FOREWORD xxi

PREFACE xxiii

ACKNOWLEDGMENTS xxix

INTRODUCTION xxxi

ACRONYMS xxxiii

1 INTERACTOMICS 1

1.1 InteractomicsandOmicsSciences / 11.2 GenomicsandProteomics / 41.3 RepresentationandManagementofProtein

InteractionData / 51.4 AnalysisofProteinInteractionNetworks / 51.5 VisualizationofProteinInteraction

Networks / 61.6 ModelsforBiologicalNetworks / 7

vii

viii    CONTENTS

1.7 FlowofInformationinInteractomics / 81.8 ApplicationsofInteractomicsinBiologyand

Medicine / 101.9 Summary / 11

2 TECHNOLOGIES FOR DISCOVERING PROTEIN INTERACTIONS 13

2.1 Introduction / 132.2 TechniquesInvestigatingPhysical

Interactions / 142.3 TechnologiesInvestigating

KineticDynamics / 172.4 Summary / 18

3 GRAPH THEORY AND APPLICATIONS 21

3.1 Introduction / 213.2 GraphDataStructures / 223.3 Graph-BasedProblems

andAlgorithms / 283.4 Summary / 31

4 PROTEIN-TO-PROTEIN INTERACTION DATA 33

4.1 Introduction / 334.2 HUPOPSI-MI / 344.3 Summary / 41

5 PROTEIN-TO-PROTEIN INTERACTION DATABASES 43

5.1 Introduction / 435.2 DatabasesofExperimentallyDetermined

Interactions / 455.3 DatabasesofPredictedInteractions / 555.4 Metadatabases:Integrationof

PPIDatabases / 625.5 Summary / 70

CONTENTS    ix

6 MODELS FOR PROTEIN INTERACTION NETWORKS 71

6.1 Introduction / 716.2 RandomGraphModel / 726.3 Scale-FreeModel / 736.4 GeometricRandomGraphModel / 736.5 StickinessIndex(STICKY)Model / 746.6 Degree-WeightedModel / 746.7 NetworkScoringModels / 756.8 Summary / 76

7 ALGORITHMS ANALYZING FEATURES OF PROTEIN INTERACTION NETWORKS 79

7.1 Introduction / 797.2 AnalysisofProteinInteractionNetworks

throughCentralityMeasures / 807.3 ExtractionofNetworkMotifs / 817.4 IndividuationofProteinComplexes / 887.5 Summary / 99

8 ALGORITHMS COMPARING PROTEIN INTERACTION NETWORKS 101

8.1 Introduction / 1018.2 LocalAlignmentAlgorithms / 1048.3 GlobalAlignmentAlgorithms / 1098.4 Summary / 111

9 ONTOLOGY-BASED ANALYSIS OF PROTEIN INTERACTION NETWORKS 113

9.1 DefinitionofOntology / 1139.2 LanguagesforModelingOntologies / 1159.3 BiomedicalOntologies / 1169.4 Ontology-BasedAnalysisofProtein

InteractionData / 117

x    CONTENTS

9.5 SemanticSimilarityMeasuresofProteins / 1209.6 TheGeneOntologyAnnotation

Database(GOA) / 1229.7 FussiMegandProteinOn / 1239.8 Summary / 123

10 VISUALIZATION OF PROTEIN INTERACTION NETWORKS 125

10.1 Introduction / 12510.2 Cytoscape / 12610.3 CytoMCL / 12710.4 NAViGaTOR / 12810.5 BioLayoutExpress3D / 13010.6 Medusa / 13010.7 ProViz / 13110.8 Ondex / 13210.9 PIVOT / 13210.10 Pajek / 13310.11 Graphviz / 13410.12 GraphCrunch / 13410.13 VisANT / 13510.14 PIANA / 13610.15 Osprey / 13610.16 cPATH / 13710.17 PATIKA / 13810.18 Summary / 139

11 CASE STUDIES IN BIOLOGY AND BIOINFORMATICS 141

11.1 AnalysisofanInteractionNetworkfromProteomicData / 141

11.2 ExperimentalComparisonofTwoInteractionNetworks / 143

11.3 Ontology-BasedManagementofPIN(OntoPIN) / 145

CONTENTS    xi

11.4 Ontology-BasedPredictionofProteinComplexes / 149

12 FUTURE TRENDS 151

REFERENCES 157

INDEX 177

LIST OF FIGURES

xiii

1.1 FragmentoftheyeastPPInetworkshowinginteractingpartnersoftheMCM1protein.DataareextractedfromtheMINTdatabase. 3

1.2 Flowofinformationininteractomicsfromwet-labexperimentstoknowledge. 8

3.1 Modelingfriendshiprelationsusinggraphs.Thegraphshowsfriendshipsamongfourpeople:Joey,Johnny,Tommy,andDede.JoeyisafriendofDede,Tommy,andJohnny;andDedeisafriendofJohnny,Joey,andTommy. 22

3.2 Exampleofagraphmodelingproteininteractions.Thegraphrepresentsfourproteins:A,B,C,andDandtheinteractions(A,B),(B,C),(B,D),and(C,D). 22

3.3 (a)Undirectedand(b)directedGraphs. 233.4 Bipartitegraph.Redandyellowcolorsrepresent,

respectively,theV1andV2sets. 243.5 Undirectedgraphmodelingasimplenetwork. 253.6 Graphanditsrepresentationasanedgelist.Since

thegraphisundirected,edgesarecomparedonlyonceatime. 25

3.7 Graphanditsincidencematrix. 253.8 Graphanditsadjacencymatrix. 25

xiv LIST OF FIGURES

3.9 Centralitymeasures. 273.10 Nodedegreeascentralitymeasure.Nodecolors

representthenodedegree.Brightcolorsindicatenodeswithalowvalueofnodedegree. 27

3.11 Closenessascentralitymeasure.Brightcolorsindicatenodeswithalowclosenesscentralityvalue. 28

3.12 Betweennessascentralitymeasure.Brightcolorsindicatenodeswithalowcentralityvalue. 29

3.13 Comparisonofgraphtraversalalgorithms. 294.1 SchemaofthePSI-MIXML2.5fileformat.Theroot

ofadocumentisrepresentedbyanentrysetelementthatcontainsoneormoreentries,aself-containedcontainerdescribingalltheinteractions,andtherelatedmetadata. 34

4.2 ProteininteractionextractedfromtheMIPSdatabaseencodedintheHUPOPSI-MIXML2.5format. 36

4.3 InteractionlistsectionofthePSI-MIcoderelativetotheproteinid3807. 37

4.4 WorkflowofdatawithintheIMExconsortium.PartnersofIMExseparatelyproducetheirdata.ThentheymakeavailableallthedatafollowingtheIMExrules.Finally,theendusercanretrievesuchdatabyusingasingleinterfaceavailablethroughtheIMExwebserver. 40

5.1 SnapshotoftheDIPdatabaseshowingtheBRCA1proteininhumansanditsinteractingpartners.DIPpresentsresultsinagraphicformatshowingthegraphconstitutedbytheBRCA1protein(inred)anditsinteractors.UserscanalsonavigatethroughweblinksretrievingfunctionalinformationaboutBRCA1. 47

5.2 SnapshotoftheBINDdatabaseshowingtheBRCA1proteininhumansanditsinteractingpartners.BINDpresentsresultsinasimpletabularformat.UserscanalsousetheinteractionviewerbasedonCytoscapetographicallyexploretheinteractions. 49

5.3 SnapshotoftheMINTdatabaseshowingtheBRCA1proteininhumansanditsinteractingpartners.MINTpresentsresultsbothintabularformat,ontheleft,andinagraphicformat,ontheright,showingthe

LIST OF FIGURES xv

graphconstitutedbytheBRCA1protein(inred)anditsinteractorsthroughanembeddedviewer. 50

5.4 InteractingpartnersoftheYAL035WyeastproteinobtainedbyqueryingtheMIPSdatabase.Theresultinginteractionnetworkcanbevisualizedthroughanintegratedvisualizer(asshownintheboxontheright). 52

5.5 SnapshotoftheIntActdatabaseshowingtheBRCA1proteininhumansanditsinteractingpartners. 54

5.6 SnapshotoftheBioGRIDdatabaseshowingtheBRCA1proteininhumansanditsinteractingpartners.BioGRIDpresentsresultsinagraphicformat.Tablesmaybesortedorcollapsed. 55

5.7 Processofpredictionofprotein–proteininteractions.Startingfromanexistingdataset,thealgorithmsmergeexistingdataandbiologicalknowledge,forexample,codedinbiologicalontologies.Theresultofsuchaprocessistheaccumulationofnewdatastoredinderiveddatabases. 57

5.8 SnapshotoftheI2DresultpageshowingtheBRCA1proteininhumansanditsinteractingpartners.I2Dpresentsresultsinatabularformatasthedefault.ResultsmayberenderedasagraphbyusingNAViGaTOR. 59

5.9 SnapshotoftheIntNetDBdatabaseshowingtheTP53proteininhumansanditsinteractingpartners.IntNetDBpresentsresultsinatabularformatasthedefault.Userscanvisualizethegraphconstitutedbythequeryproteinanditsinteractingpartnersorcandownloaditasavectorialimage. 60

5.10 VisualizationinSTRING:ThenetworkrepresentstheBRCA1queryprotein(representedasarednode)anditsinteractingpartners.Nodesarecoloredbecausetheyaredirectlylinkedtothequeryprotein.Edges,thatis,predictedfunctionallinks,consistofuptoeightlines.Eachcolorrepresentsdifferentevidenceforthatinteraction. 61

5.11 HAPPIdatabaseiscreatedbyextractingproteininteractiondatafromHPRD,BIND,MINT,STRING,andOPHID.Oncecollected,dataare

xvi LIST OF FIGURES

integratedusingdatabaseintegrationtechniques,intoaunifieddatamodel.Finallydataarescoredbyapplyingaunifiedscoringmodelandannotationsarealsocomputationallyderived. 64

5.12 ResultsvisualizationinHAPPI.PagecontainstheBRCA1queryproteinanditsinteractingpartners.Foreachinteractingpartnerthesourceofinteractionandthescoreofconfidencearealsoreported.Otherinformationcanbeobtainedbybrowsingthehyperlinks. 65

5.13 ProcessofcreationoftheAPIDdatabase.ThekeypointoftheintegrationistheunificationofalltheproteinidentifiersusingthecommonacceptedUniprotcodes.Finally,eachinteractionisannotatedbythecalculationofparametersthatindicatethereliabilityoftheinteractionitself. 66

5.14 VisualizationofresultsinAPID.FigurerepresentstheBRCA1queryproteinanditsinteractingpartners(storedinatableinthebackground).APIDalsoenablesthevisualizationofthecorrespondingnetwork.Nodesarecoloredbecausetheyaredirectlylinkedtothequeryprotein. 67

5.15 VisualizationofresultsinMiMI.ThetablerepresentstheBRCA1queryproteinanditsinteractingpartners. 68

5.16 ProcessofcreationofUniHi.UniHifocusesonhumanproteininteractions.Dataareextractedfrommaindatabasesofbothpredictedandexperimentalinteractions. 68

5.17 VisualizationofresultsinUniHi.Resultsarepresentedinatabularwayasdefault,anduserscanalsovisualizeagraph.Thegraph(intheupperrightcorner)representstheBRCA1queryprotein(representedasarednode)anditsinteractingpartners.Nodesarecoloredbecausetheyaredirectlylinkedtothequeryprotein. 69

7.1 Examplesofnetworkmotifs.Linearpathsareindicatedin(a),(b),(c),and(d).Cliquesareindicatedin(f),(g),and(j).Starsandloopsareindicated,respectively,as(e)and(h)and(i). 82

LIST OF FIGURES xvii

7.2 Workflowofextractionofnetworkmotifs. 837.3 Motifsconsideredinpowergraphanalysis. 877.4 FragmentofhumanPPInetworkshowingBRCA1

interactingpartnersextractedfromMINTdatabase. 88

7.5 FragmentofaPPInetworkshowingthestructureofproteincomplexes.Redandyellownodeshighlighttwodensesubregionsthatmayrepresentproteincomplexes. 89

7.6 SimulationoftheevolutionofflowinanetworkasperformedbyMCL. 92

7.7 Workflowoftheexecutionofthepredictionofaproteincomplexthroughclusteringoftheinputnetwork. 93

7.8 Threepossiblewaystocombineinteractions.Letusconsiderfourproteinsandasinglebait(Y),whichisidentifiedtogetherwiththepreviousones.Figuredepictsthreewaystoassigninteractionstoproteins. 96

7.9 WorkflowoftheexecutionofapredictioninProCope. 97

7.10 GUIoftheIMPRECOtool. 998.1 Processofalignmentoftwographs.Inthiscasepairs

ofcorrespondentnodesare(v1,u1),(v2,u2),(v3,u3),(v11,u11),and(v4,u9)(correspondencesareevidencedbyreddottedlines)sothealignmentgraphAlcontainsfivenodesandtherelativeedges. 103

8.2 HomepageofthePathBLASTwebserver. 1058.3 HomepageoftheNetworkBLASTwebserver. 1079.1 Workflowofenrichmentanalysis. 1199.2 ExampleofGOA. 122

10.1 GraphicaluserinterfaceofCytoscape.Themainwindowisusedtovisualizethenetwork.Theboxonthebottomdepictstheannotationsofthenodeswhiletheboxesontheleftoffertotheusersasetoffunctionalities(e.g.,nodeselection). 127

10.2 GraphicaluserinterfaceofCytoMCL.Themainwindow,fullyintegratedintoCytoscape,isusedtoselectthealgorithmparameters.TheboxontheleftdepictsanextractedsubnetworkthatisvisualizedthroughCytoscape. 129

xviii LIST OF FIGURES

11.1 WorkflowofanalysisofaPINreconstructedfromaproteomicexperiment. 142

11.2 Comparativeanalysisoftwointeractionnetworks. 14411.3 Architectureoftheannotateddatabase. 14611.4 Localizationofinteractingproteins. 15012.1 OverallsnapshotofPPIdatamanagement. 152

LIST OF TABLES

xix

2.1 DescriptionofProteinMicroarrays 154.1 CurrentPartnersoftheIMExConsortium 415.1 DIPDatabaseInformation 465.2 BINDDatabaseInformation 485.3 MINTDatabaseInformation 495.4 IntActDatabaseInformation 535.5 BIOGRIDDatabaseInformation 565.6 I2DDatabaseInformation 585.7 IntNetDBDatabaseInformation 595.8 STRINGDatabaseInformation 625.9 HAPPIDatabaseInformation 635.10 APIDDatabaseInformation 655.11 MiMIDatabaseInformation 675.12 UNIHIDatabaseInformation 696.1 ComparisonofRandomGraphand

Scale-FreeModels 7311.1 LocalizationofProteins 148

FOREWORD

xxi

The management and analysis of protein–protein interactions (PPI) is fundamental to the understanding of cellular organizations, pro-cesses, and functions. It has been observed that proteins seldom act as single isolated species in the performance of their functions; rather, proteins involved in the same cellular processes often interact with each other. Therefore, the functions of uncharacterized proteins can be predicted through comparison with the interactions of similar known proteins. A detailed examination of a protein–protein inter-action network can thus yield significant new insights into protein functions. Traditionally, each laboratory experiment observes only a few protein interactions and yields a data set of very limited size. Recent large-scale investigations of protein–protein interactions using such techniques as two-hybrid systems, mass spectrometry, and protein microarrays have enriched the available protein interaction data and facilitated the construction of integrated protein–protein interaction networks. Many protein interaction databases are avail-able. The resulting large volume of protein–protein interaction data has posed a challenge to experimental investigation. Consequently, computational analysis of the networks has become a necessary tool for the determination of functionally associated proteins.

In 2009, I published a book titled Protein Interaction Networks—Computational Analysis (Cambridge University Press), which gave an introduction to the cutting-edge computational approaches to