biological networks and analysis of experimental data in...

DDT • Volume 10, Number 9 • May 2005

Revi

ews

•DR

UG

DIS

CO

VER

Y T

OD

AY

:BIO

SILI

CO

653www.drugdiscoverytoday.com

Over the past several years there has been a paradigmshift in life science research as a result of the unprecedented advances in several laboratory tech-niques, such as automated DNA sequencing, globalgene expression measurements, and proteomics andmetabonomics techniques. The high throughputdata collectively referred to as ‘OMICs’ data are ubiq-uitous throughout the drug discovery pipeline fromtarget identification and validation to the developmentand testing of drug candidates. However, OMICsdata are poorly utilized because of the lack of adequatemethods for interpretation in the context of diseaseand biological function. Although bioinformaticshas developed robust statistical solutions for evalua-tion of the significance and clustering of data points,statistics alone does not explain ‘the underlying biology’.

The complexity of our own biology requires a sys-tem-wide approach to data analysis, which can bedefined as the integration of ‘OMICs’ data usingcomputational methods [1]. It is clear from the fieldthat the identification of the ‘part list’ of all the genesand proteins is insufficient to understand the whole.It is the assembly of these parts (the general schema,the modules and elements) and the dynamics of

changes in response to stimuli that is truly the keyto understanding life, form and function [2,3]. Theassembly of ‘cellular machinery’ is most effectivelypresented as the ‘interactome’, the network ofinterconnected signaling, regulatory and biochemi-cal networks with proteins as the nodes and physi-cal protein–protein interactions as edges [4,5]. As inmany fields of science, technology and social life,the topology and dynamics of complex networkscan be studied by graph theory [5]. The informationabout protein interactions has been collated fromthe vast amount of published experimental data annotated and assembled in interactions databases.Network data analysis tools are now commerciallyavailable and robust enough for simultaneous pro-cessing of multiple ‘whole genome’ data files, such ashuman expression microarrays. Recently, the inter-pretation of experimental ‘OMICS’ datasets in thecontext of accumulated knowledge on human func-tional networks was described as the first step instudying complex systems [2,6]. Now, we can con-sider the building of the basic framework, databasesand logistics needed to accomplish this. Networks-centered data analysis is well underway in the majorpharmaceutical companies. In this review, we will

Yuri Nikolsky*

Tatiana Nikolskaya

Andrej Bugrim

GeneGo,

500 Renaissance Drive, #106,

St. Joseph,

MI 49085,

USA

*e-mail: [email protected]

Biological networks and analysis ofexperimental data in drug discoveryYuri Nikolsky,Tatiana Nikolskaya and Andrej Bugrim

Cellular life can be represented and studied as the ‘interactome’ – a dynamicnetwork of biochemical reactions and signaling interactions between active proteins.Systemic networks analysis can be used for the integration and functionalinterpretation of high-throughput experimental data, which are abundant in drugdiscovery but currently poorly utilized. The composition and topology of complexnetworks are closely associated with vital cellular functions, which have importantimplications for life science research. Here we outline recent advances in the field,available tools and applications of network analysis in drug discovery.

1359-6446/05/$ – see front matter ©2005 Elsevier Ltd. All rights reserved. PII: S1359-6446(05)03420-3

REVIEWS


Reviews •D

RU

G D

ISCO

VER

Y TO

DA

Y:B

IOSILIC

O

654 www.drugdiscoverytoday.com

describe the state of the field, present the available toolsand suggest the applications of networks analysis through-out the drug discovery pipeline.

Protein interactions as building blocks for biologicalnetworksThere are multiple ways to elucidate protein–protein inter-actions. One approach is to screen experimental literature,using text mining algorithms, for co-occurrence (there-fore, association) of gene/protein symbols and names inthe same text. Typically, Natural Language Processing(NLP) and other text-mining algorithms are used for theautomated ‘mining’ of abstracts and titles of PubMed arti-cles [7–9]. It is generally believed (pers. commun. frompharmaceutical researchers), and supported by compara-tive studies, that up to a half of NLP associations do notcorrespond to experimentally verified protein interactions[9,10], although over 60% of shown interactions can beelicited by automated text mining [11]. The reliability ofNLP-derived associations can be enhanced by the com-pilation of field-specific synonyms dictionaries, usinglonger word strings for searching and full-text articles toquery against, and statistical methods (reviewed in [9]) Ina recent study, the NLP engine MedScan was used to extract280,000 functional relations including 20,000 protein interaction facts between human proteins from full textarticles with a precision of 91% for 361 randomly extractedprotein interactions [12].

The interactions can also be derived from high-through-put experimentation. For example, the yeast 2-hybrid(Y2H) screen test identifies protein interactions in yeastcells [13]. A widely used wet laboratory technique, Y2Hwas scaled-up for global mapping of protein interactionsin yeast [14], fly D. melanogaster [15] and worm C. elegans[16] and became the technology base for several tools anddiscovery companies such as Curagen (www.curagen.com)and Hybrigenics (www.hybrigenics.fr). However, Y2H-derived interactions are known for a high (over 50%) levelof false positives and false negative interactions [17,18].In a recent study, over 70,000 interactions for 6,231 humanproteins were predicted assuming the interactions betweenthese proteins’ orthologs in yeast, worm and fly [19]. Theaccuracy of predicted interactions remains questionable.Although it was assessed computationally based on rela-tive correspondence of interacting protein pairs to geneontology processes, the interactions were not directlycompared with any high-confidence experimental setnow available [20,21]. The interactions can also be deducedfrom condition-specific co-occurrence of gene expressionbased on the assumption that interacting proteins must beexpressed in sync [22], especially when encoded by thehomologous genes [23]. Abundant and readily obtainableeven from small cell populations, it is often stated thatco-expression-based clustering will become the major approach for determining tissue-, disease- and treatment-specific interactions. However, the overall confidence in

co-expression-derived interactions in yeast is about 50%(47% anti-correlation for novel interactions) [24,25].Another method, co-immunoprecipitation (Co-IP) con-sists of the affinity precipitation of protein complexes inmild conditions using antibodies to one of the complex’ssubunits, followed by mass-spectrometry or western blotanalysis. Unlike the discussed above methods, Co-IP enables direct and quantitative detection of interactionsbetween active proteins, and so it is a true proteomicsmethod. Co-IP was used in simultaneously publishedstudies of the yeast interactome [26,27]. The other, lesscommonly used experimental and computational meth-ods include protein arrays, fusion proteins, neighborgenes in operons (for prokaryotic proteins), paralogousverification method (PVM), co-localization, syntheticlethality screens and phage display; each method with itsmerits and biases (reviewed in [28,29]). The overall con-fidence in interactions, defined as the intersection betweeninteracting pairs obtained using different methods,remains dismal. For instance, over 80,000 protein–pro-tein interactions were detected in S.cerevisiae by six high-throughput (HT) experimental methods, but only 2,400of these interactions were supported by more than onemethod [30]. Such low overlap limits the applicability ofa direct comparison between HT interactions datasets ofdifferent experimental origin. Recently, statistical meth-ods were developed for enhancing the confidence of interactions derived from low confidence data and for analyzing the general parameters of interaction datasets[31,32]. Y2H and Co-IP yeast protein interaction data applied in yeast were extensively compared for experi-mental biases and correlation [31]. Although only 6% ofY2H interactions were confirmed by the Co-IP method,the authors managed to develop a statistical regressionmodel for prediction of biological relevance and confidenceof HT interactions based on sub-network analysis [31]. Inanother study, graph-theoretical statistics were used forcomparative analysis of the interaction datasets in yeast[32]. The parameters and algorithms were realized in thepublicly available tool TopNet for comparison of biolog-ical sub-networks of different origin (http://networks.gersteinlab.org/genome/interactions/networks/core.html).

In general, it is believed that only manually curatedphysical protein interactions extracted from original ‘smallscale’ experimental literature can be used with sufficientconfidence [28,29].

Dozens of the original and compilation academic pro-tein–protein and protein–DNA interaction databases areavailable, covering high-throughput and ‘small scale’ experimental interactions experimentally and computedinteractions. We have outlined some key database projects,pathways database and analytical tools in Table 1.

Architecture and composition of biological networksBiological networks are presented as nodes (proteins, genesand compounds) connected by edges (protein–protein,

REVIEWS


Revi

ews

•DR

UG

DIS

CO

VER

Y T

OD

AY

:BIO

SILI

CO


REVIEWS

TABLE 1

Tools and databases for network analysis

Name Description URL address

Protein interaction databases

BIND A curated database of interactions, derived both from the literature and experimentaldatasets. 8,500 interactions are deduced from high-confidence small scale experimentsfrom multiple species. BIND can be used for querying and as a browser [65]

http://bind.ca

DIP A database of experimentally determined protein–protein interactions, mostly fromyeast. Around 10% of DIP interactions are derived from high confidence small scaleexperiments [29,66]

http://dip.doe-mbi.ucla.edu/

HPRD Human Protein Reference Database provides curated human-specific proteininteractions; currently >22,000 interactions for >10,000 human proteins. It also contains 7signaling maps. HPRD is used as a browser for interactions, protein annotations, motifs anddomains [20]

http://www.hprd.org/

MetaCoredatabase

A manually curated interactions database for >90% human proteins with known function[53,56]. Content of MetaCore (see below)

http://www.genego.com.

MINT andHomoMINT

A searchable interaction database with total of 40,000 interactions, mostly from yeastand fly. 70% of interactions are from lower-confidence Y2H screens. Only 3800 interactionsinclude human proteins [67]

http://mint.bio.uniroma2.it/mint/

MIPS A well-known searchable database on high-quality small scale experiments protein–proteininteractions in yeast [68] and most recently mammals [21]. Several hundred humaninteractions

http://mips.gsf.de

PathArtdatabase

A manually curated database of ~7,500 protein–protein and protein–compoundinteractions and pathways. Content of PathArt (see below)

http://jubilantbiosys.com

PathwayAnalysisdatabase

The mammalian interactions content of Pathway Analyst (see below). The number ofinteractions is not announced

http://www.ingenuity.com

ResNetdatabase

Automatically extracted and manually validated database of human protein interactions(>30,000), transcriptional regulation (10,000), protein modifications (10,000) and functionalregulations (350,000) [12]. Content of PathwayAssist(see below).

http://www.ariadnegenomics.com

STRING A database of known and predicted protein interactions deduced from over 110 genomes,high-throughput experiments and gene co-expression [69]

http://string.embl.de

Pathways maps and process ontologies

BIOCarta A commercial collection of ~350 maps on human biology representing canonical pathways http://www.biocarta.com/genes/index.asp

GeneOntology

The most often referred to publicly available protein classification based on cellularprocesses developed by Gene Ontology Consortium [52]

http://www.geneontology.org

GenMAPP Gene MicroArray Pathway Profiler is a database of GO-derived diagrams designed forviewing and analyzing gene expression data [17]

http://www.genmapp.org

KEGG A well known database of generic metabolic maps for bacteria and eukaryotes. Recentkyadded some regulation maps. Software allows comparison of genome maps, graphcomparison and path computation [70]

http://www.genome.jp/kegg/pathway.html

MetaCore,pathwaymodule

A part of the commercial tool MetaCore™. The pathways module contains 350 interactivemaps for >2,000 established pathways in human signaling, regulation and metabolism.HT data can be superimposed on the maps and networks built for any object

http://www.genego.com

Protein Lounge A commercial package with ~300 human metabolic and signaling maps http://www.proteinlounge.com

Network data mining suites

MetaCore/MetaDrug,GeneGo,Inc

An integrated analytical suite based on a manually curated database of human protein–protein and protein–DNA interactions. All types of HT data can be used for buildingnetworks. Medicinal chemistry module allows predicting human metabolism and toxicityfor novel compounds. Networks are connected to functional processes, 350 proprietarymetabolic and signaling maps. Web access or enterprise solution

http://www.genego.com

PathwayAnalyst,Ingenuity, Inc.

An integrated analytical suite based on a manually curated database of literature-derivedmammalian protein–protein interactions. Visualization on networks and analysis of HT data.Networks are connected to GO processes, 60 KEGG metabolic maps and Cell Signaling Inc.’ssignaling maps. Web access, enterprise solution

http://www.ingenuity.com

PathArt,JubilantBiosystems

A curated database of generic protein interactions, pathways and bioactive moleculessupported by HT data parsers and visualization tools. Connectivity with ligand databases,GO categories. Web access

http://www.jubilantbiosys.com/pd.htm

PathwayAssist,AriadneGenomics

A software tool for mapping the HT data on networks, maps and pathways. The sourceof interactions data are NLP mining of PubMed abstracts. PathwayAssist is bundled withJubilant and Integrated Genomics pathways content. A desktop product

http://ariadnegenomics.com


Reviews •D

RU

G D

ISCO

VER

Y TO

DA

Y:B

IOSILIC

O


protein–gene, protein–compound interactions and meta-bolic reactions). Depending on the type of underlying dataand the interaction mechanism, the edges are either directedor undirected. For instance, protein binding interactionsderived from Y2H assays are undirected, while most ofthe physical interactions extracted from full text articleshave one direction (e.g. protein A activates protein B, butnot vice versa) There are several major parameters the net-works can be described and compared with (reviewed in[32]; see Figure 1). (1) Average degree (K) is the average number of edges pernode. In directed networks one can distinguish incomingdegree (Kin), outgoing degree (Kout) and total degree. (2) Average clustering coefficient (C) is the average ratioof the actual number of links between the node’s neighborsand the maximum possible number of links between them;The clustering coefficient for the node i can be calculatedas Ci= 2ni/k(k-1), where ni is the actual number of links con-necting k neighbors of the node to each other Figure 1b. (3) Shortest path l AB for the pair of nodes is the mini-mum number of network edges that need to be passedto travel from A to B. On a directed graph the shortestpath from A to B may be different from the path from Bto A as shown on Figure 1a. Characteristic path length (L)is the average length of ‘shortest paths’ for all pairs ofnodes on the graph.

(4) Diameter (D) is the longest distance between a pairof nodes on the graph. The ‘default’ random network the-ory states that pairs of nodes are connected with equalprobability and the degrees follow a Poisson distribution.This implies that it is very unlikely for any node to havesignificantly more edges than average [32]. The analysisof the yeast interactome (the best studied organism interms of interactions) revealed that the networks are remarkably non-random and the distribution of edges isvery heterogeneous, with a few highly connected nodes(hubs) and the majority of nodes with very few edges.Such topology is defined as scale-free, meaning that thenode connectivity obeys a power law: P(k) ~ k–g, whereand P(k) is the fraction of nodes in the network with exactly k links [33]. Interestingly, the hubs are predomi-nantly connected to low-degree nodes, a feature that givesbiological networks the property of robustness. A removalof even a substantial fraction of nodes still leaves the net-work connected [34]. At the level of global architecture,networks of different origin (e.g. metabolic, regulatory,protein interactions, networks for different organisms)share the same properties [15,33,35,36]. Taken together,the metabolic reactions and signaling interactions forma large cluster linked via molecular nodes shared amongmany cellular processes [18]. This runs contrary to a‘traditional’ model of small and relatively independentlinear pathways.

Network modules and substructure searchThe key property of biological networks is their modularnature [4]. According to modular theory, various kinds ofcellular functionality are provided by relatively small,transient but tightly connected networks of molecules(5–25 nodes) that are engaged in performing specific func-tions. Identification of such modules is a non-trivial prob-lem as complex networks can be parsed into subsets inmany different ways, potentially generating billions ofcombinations. For example, our analysis of the networkof a subset of 35,000 experimentally proven human sig-naling interactions in the MetaCore™ database revealedapproximately 2 billion linear 5-step network paths thatwere all physically possible. It is clear that only few ofthese paths are realized in any cell and at any particularlytime as active pathways.

Different approaches have been offered for automatedparsing of large networks into modules. One set of methodsidentifies the modules using various clustering algorithms.These include the Monte Carlo optimization method forfinding tightly connected clusters of nodes [37], cluster-ing based on shortest-paths-length distribution [38] andunsupervised graph clustering [39]. It was shown thatsome clusters identified using approaches such as thesecorrespond either to known protein complexes or tometabolic pathways [37]. Another approach consists ofparsing networks into motifs, defined as fairly simple sub-graphs that share certain structural and functional features,

REVIEWS

FIGURE 1

Network architecture and analysis. (a) A directed network has three types of nodedegree: Kin, Kout, and Ktotal (shown for node A).The shortest paths between two nodesdepend on direction, e.g.: A→B, red; B→A, green. (b) Calculation of a clusteringcoefficient for node A (solid lines show actual interactions, dashed lines otherpossible connections). (c) Several types of network motifs (clique is a pattern whereevery node connected to every other node); (d) When analyzing expression datawithin networks, both similarity of expression and connectivity are taken intoaccount: nodes A to E represent a potential condition-specific module – they havesimilar expression patterns and are connected into a tight cluster. Nodes F, H, J, L andM also share an expression pattern, but are not well connected, hence are less likelyto constitute a functional pathway.

Drug Discovery Today

A

C

A

B

C

G

F

E

B

D

B

AD

C

E

A

B

C

C

AE

B

D

JHG

F

I

K

GL

MB

A

D

C

Clique

(a)

(c)

(b)

(d)

Kin= 3; Kout= 2; KTOTAL= 5;lAB= 1; lBA= 3

nA= 6, k = 4CA = 0.5

Feedbackloop

Feedforwardloop


Revi

ews

•DR

UG

DIS

CO

VER

Y T

OD

AY

:BIO

SILI

CO


such as a feedback or feed-forward loop [40] (Figure 1c).The number of different motifs in the network is calculatedand then compared with the number of the same motifsin a randomly connected network. Those motifs in whichthe network is enriched when compared to the randomnetwork may represent potential functional modules.Such motifs were identified in regulatory networks of E. coli [41] and yeast [42]. It should be noted that per-formance of these algorithms is usually judged by howwell they can recall known functional units or processes.On this account, all these algorithms are prone to a highlevel of false-positives (i.e. recalling modules that do notcorrespond to any known pathways.

Conditionally active functional modules can also beelucidated by the analysis of high-throughput moleculardata (e.g. gene expression, protein abundance, metabolicprofiles) in the context of networks. One straightforwardapproach relies on statistical clustering of gene expressiondata followed by mapping the resulting clusters onto thenetworks obtained from independent sources [43]. Theadvantage of this approach is the prioritization of geneclusters based on the number of links to the network. Thedrawback is that the statistics-derived clusters are inher-ently artificial and can be connected to multiple networksand cellular processes. In another method, the networkclustering algorithms, such as superparamagnetic clus-tering, are used to identify tightly connected sets of nodes.The expression data helps to assign weights to the edgesand nodes; the combined distance is then computedbased on both expression profiles and the network distancebetween gene products [44]. Other methods include sim-ulated annealing [45] and probabilistic graphical models[46]. Essentially, analysis of molecular data within thecontext of interaction networks reveals genes that sharea similar pattern of expression and at the same time areclosely connected on the network (Figure 1d). Anotherimportant way of finding putative functional pathways iscomparison of networks derived from different data sources.For example, a heuristic graph comparison algorithm wasdeveloped for finding functionally related enzymes clusters(FRECS) across bacterial species [47] and between proteinand gene expression networks [48]. Another algorithm allows one to identify common interaction pathways byinter-species alignment of protein interaction networks;for example between yeast S. cerevisiae and bacterium H. pylori [49].

Biological implications of network topologyThe non-random nature of biological networks is associatedwith the biological functions of nodes and edges. Recently,several studies in yeast revealed correlations between thetopology and composition of a network and importantbiological properties of nodes [17,18,50,51]. The well con-nected hubs (defined here as the top quartile of all nodes interms of the number of edges) are largely represented byevolutionary conserved proteins because the interactions

impose certain structural constraints on sequence evolu-tion [50]. In both S. cerevisiae and C. elegans, a significantnegative correlation was shown between the number ofinteractions and the relative evolutionary rate [50]. Recently,it was revealed that the number of interactions a proteinhas positively correlates with its relative importance inyeast [18,51]. Essential and ‘marginally essential’ (relativeimportance of a non-essential gene to a cell) genes tendto be hubs with short characteristic path length to theneighbors [51]. Essential proteins tend to be more closelyconnected to each other. Furthermore, essential proteinstend to be the more promiscuous transcription factors andtarget genes that are regulated by fewer transcription fac-tors. Many of these targets are ‘housekeeping’ genes withhigh expression levels and less expression fluctuation [51].It was also noted that soluble proteins possess more inter-actions than membrane proteins [32]. As mentioned above,the links between highly-connected and low-connectedpairs of proteins defines the specific topology of a network.In yeast, the direct links between highly connected hubsare suppressed and the interactions between hubs and low-connected node pairs are favored. Such topology probablyprevents cross-talk between the functional modules-sub-networks [17]. The findings may have substantial implica-tions for the practice of drug discovery in terms of targetprioritization and identification of multi-gene/multi-proteinbiomarkers.

Mapping and interpretation of experimental data onthe networksBiological networks are the most suitable tool for func-tional mining of large, inherently noisy experimentaldatasets such as microarray and SAGE expression patterns,proteomics and metabonomic profiles. There is an important distinction between networks and the othermethods available for HT data analysis (such as statisticalclustering, linking to pathway databases, process ontolo-gies, pathway maps, cross-species comparisons etc.). Unlikeother methods, networks’ edges provide primary infor-mation about physical connectivity between proteins,their subunits, DNA sequences and compounds. The com-plete set of interactions which assembles into networks,defines the potential of a cell to form multi-step pathways,signaling cascades and protein complexes representing thecore machinery of cellular life in health and disease.Obviously, only a fraction of all possible interactions isactivated at any given condition as only some of the genesare expressed in tissues at a time and only a fraction ofthe cellular protein pool is active. The subset of activated(or repressed) genes and proteins is captured by OMICsexperiments, such as global gene expression profiles, pro-teomics or metabonomics profiles – the functional snap-shots of cellular response. Analyzed separately, these datasetscannot explain the whole picture. There are many levelsof information flow between a gene and an active proteinit encodes, including gene expression, mRNA processing,

REVIEWS


Reviews •D

RU

G D

ISCO

VER

Y TO

DA

Y:B

IOSILIC

O


protein trafficking, posttranslational modifications, foldingand assembly into active complexes (Figure 2). Eventually,active proteins perform certain cellular functions (such asa metabolic transformation of malonyl into acetyl-CoAin this example), which can be presented as one-step interactions in the space of thousands of metabolic trans-formations regulated at multiple levels from the cell mem-brane receptors to transcription factors. The ‘intersection’of the experimental data with the interactions content onthe networks (derived from experimental literature) pro-vides the closest possible view of the activated molecularmachinery in a cell. As all objects on the networks are annotated, they can be associated with one or more cellular functions, such as apoptosis, DNA repair, cell cyclecheckpoints or fatty acid metabolism. The networks canbe interpreted in terms of these higher level processes,and the mechanism of an effect can be unraveled. Thisis achieved by linking the network objects to Gene Ontology(GO) [52] and other process ontologies, metabolic and

signaling maps (Figure 2; Table 1). The networks can bescored and prioritized based on statistical ‘relevance’ tothe functional processes and maps or relative saturationwith the uploaded data [53]. The latter can be defined asa proportion of the nodes with the data (for example theoverexpressed genes) to the total number of nodes on thenetworks and measured with z-score. Experimentaladjustment can be done by choosing tissue, disease andexperiment specific interactions, removing and addingspecific interactions mechanisms, linking orthologousgenes from other species, etc. A network can also be con-nected to outside databases and HT data analyzing software.The outcome of such systemic analysis can be newhypotheses on the critical bottlenecks in the disease path-ways (potential drug targets) or conservative interactionsmodules supported by HT data (possible biomarkers)(Figure 2).

Therefore, networks represent a flexible and powerfulanalytical tool for comparison and cross-validation of

REVIEWS

FIGURE 2

General schema of network analysis of ‘omics’ data. Different types of high-throughput experimental data can be linked to the tables of human proteininteractions and enzymatic reactions and visualized as dynamic networks.The networks are then scored and prioritized based on relative relevance of the nodesto functional processes (Gene Ontology, or GO), or static maps of canonical pathways.The software tools can be used for elucidation of network modules andnovel pathways as the novel hypotheses for therapeutic targets and biomarkers data.


1

2

3

4

5

6

7

8

9

DNA

Info flow in cell

Blocks and Biomarkers

OMICs data Networks analysis Processes, pathways, maps,outside databases

Gene therapy,DNA markers

siRNAexpressionmarkers

Smallmolecules,Proteintherapy,Proteinmarkers

mRNA

Protein


Revi

ews

•DR

UG

DIS

CO

VER

Y T

OD

AY

:BIO

SILI

CO


different types of datasets associated witha condition, such as a disease or a drugtreatment). In fact, any experimental orliterature-derived datasets with recogniz-able gene or protein identities (such asLocusLink, Unigene, SwissProt, RefSeq,OMIM) can be visualized, mapped andcompared against each other on the samenetwork. For example, one can directlycompare the list of genes determined fromgenetic analyses with data from gene expression arrays from a patient in clinicaltrials or a knockout mouse. When the samedata type and experimental platform isused, the conditional networks can becompared in great detail for common anddifferent sub-networks and patterns. Suchfine mapping can be performed to comparethe tissue and cell type specific response,different time points, drug dosage; differentpatients from the same cohort, etc. For instance, we have compared SAGE geneexpression patterns from mammary glandduct epithelium of two breast cancer pa-tients, one from pre-invasive DSIC stage,another with invasive cancer (originaldata from [54]). Both datasets were usedfor building the initial networks, and thenvisualized separately. One of the top-scor-ing networks included the major cell pro-liferation activator oncogene c-Myc [55](Figure 3). One can see that the expressionpattern for invasive cancer (B) features manymore upregulated genes in the immediatevicinity of c-Myc. Modern integrated net-work analytical suites are well equippedwith a range of tools and algorithms forsuch analyses (Table 1).

Applications of network analysis indrug discoveryNetworks analysis is broadly applicablethroughout the drug discovery and devel-opment pipeline, both on the biology andthe chemistry side. Any type of data whichcan be linked to a gene, a protein or a com-pound, can be recognized by the inputparsers, visualized and analyzed on the net-works. Therefore, almost any pre-clinicalHT experiment, as well as patient DNA ormetabolic tests from clinical trials (Figure 4)can be included in network analyses. Mostimportantly, all these different datasets canbe processed on the same network back-bone [56]. Therefore, networks representthe universal platform for data integration

REVIEWS

FIGURE 3

The patterns of gene expression in mammary gland epithelium from a non-invasive (a) and invasive(b) breast cancer mapped onto the same network. Gene expression was measured by the SAGE methodThe network was built using the ‘shortest path’ algorithm in a standard version of MetaCore (GeneGo, Inc.).The objects marked with large red circles indicate gene overexpression.

FIGURE 4

Network analysis in drug discovery pipeline. Biological high-throughput data from different stages ofpre-clinical discovery and clinical trials can be uploaded, used for building networks, and analyzed. Humanmetabolites for novel small molecule compounds can be predicted in silico or uploaded directly. All datatypes can be visualized on the same networks.


Drug metabolizing enzymes

Metabolites fromnovel compounds

Genes with expression data,Protein abundance

Pre-clinical discovery Clinical trials

Animal/human biologyTarget identification,

validationToxicogenomics,Assays data

Lead optimization Patients genotyping, metabolites from body fluids

HT dataStatistical

significance

IDs alignment

Metabolite rulesQSAR models

Similarity search

Compounds


Reviews •D

RU

G D

ISCO

VER

Y TO

DA

Y:B

IOSILIC

O


and analysis, which has always been the major objectiveof bioinformatics technology. Network analysis of complexhuman diseases is a very young area. In one recent study,the networks automatically generated from literature interactions were applied to the elucidation of specificmodules around the genes involved in Alzheimer disease,and the scoring procedure for disease-relevant proteinnodes was developed [57]. Some applications of networkanalysis in drug discovery are listed in Box 1.

Now, we will consider three straightforward applica-tions of network analysis in more detail. In the first case,PathwayAssist (Ariadne Genomics Inc.) was used for gen-erating hypotheses for novel therapeutic targets for ovariancancer, the most lethal type of gynecologic cancer in theWestern world. The list of 1191 differentially expressedgenes including ones that are involved in cell growth, dif-ferentiation, adhesion, apoptosis and migration were iden-tified by profiling 37 advanced stage papillary serous primarycarcinomas [58]. Microarray data were imported intoPathwayAssist and a signaling pathway associated withtumor cell migration, spread and invasion was identified

as being activated in advanced ovarian cancer (Figure 5a).New pathway hypotheses generated in this study may beuseful for elucidation of novel multi-component diseasemarkers and therapeutic targets (Ariadne Genomics, priv-ileged communication) In the second case, network analy-sis in MetaCore was used for generation of small ‘signaturenetworks’ characteristic for microarray gene expressionresponse in breast cancer cell line MCF-7 in response totreatment with estrogen and tamoxifen [59]. These smallmodules consisted of two major hubs – transcriptionalfactors uniquely linked to a dozen of cell cycle genes.Signature networks were shown to be conserved betweenthe studies on different microarray platforms and, there-fore, offered as condition-specific multi-component bio-markers [59]. In the third example (Figure 5b), we usednetworks for evaluation of toxicity and human metabo-lism of acetaminophen (APAP). The structure was processedin MetaDrug [60] using metabolic cleavage rules and mod-els, and the resulted metabolites were displayed on thenetworks connected with the metabolizing enzymes. Onthe same network, we displayed microarray gene expres-sion data from livers of the rats intoxicated with high doseof APAP [61]. The resulted networks can be used as a toolfor elucidation of the effected signaling and metabolicpathways.

Future developmentAnalysis of biological networks is a young field and mostof its applications are still in their infancy although wemajor developments in this area within the next severalyears. Qualitative network analysis will be enriched withquantitative modeling and simulation methods. Semi-quantitative techniques such as flux analysis [62] and extreme pathway analysis [63] are already being appliedto the reconstruction of signaling and metabolic networks.There are also large computational frameworks such as‘Virtual Cell’ [64] that attempt to run simulations on thesimplified ‘whole cell’ level. In addition, network analysiswill be substantially scaled up to accommodate the largesets of disease-related molecular data, such as gene expres-sion profiling of hundreds and thousands of patients andthe combination of different types of data within networkanalysis will become routine. At present there are few research groups that combine different types of HT tech-niques to study particular conditions. These methodsquery different levels of cell organization (e.g. gene expres-sion, protein abundance, metabolite concentration) andnetworks will provide a framework for the efficient con-current analysis of these data. Finally, depositories ofhigher-level knowledge for particular human diseases willbe created. As large molecular datasets will be processedwith the help of network analysis, a growing set of recon-structed pathways and networks will emerge. These will behighly specific ‘signature networks’ for particular conditions,for example, a disease subtype, treatment response, toxinaction, and so on.

REVIEWS

BOX 1

Applications of network analysis in drug discovery

• Target identification. Experimental data from modelorganisms, cell lines and human tissues can be uploadedand mapped on networks. New hypotheses can be madeon the pathways connecting the proteins of interest

• Target validation and prioritization. Data cross-referencing on the same networks, maps and pathways

• Disease biomarkers.The biomarkers can be identified as‘signature networks’ – condition-specific conserved setsof nodes supported by differential gene expression andprotein abundance data

• Toxicity biomarkers. Same as above, with signaturenetworks derived from toxicogenomics data – typicallyrat or mouse liver arrays from drug-treated animals

• Pharmacogenomics/haplotyping. The networksmodules can be used as a mean for haplotyping SNPsassociated with a particular condition

• Lead optimization and selection of drug candidates.New compounds and their metabolites from pre-clinicalstudies can be mapped on tissue and disease specificmetabolic and regulatory networks via structure similaritysearches with metabolites and ligands included in thedatabase.

• Clinical studies. The patients data (specific DNAsequences, expression microarrays, metabolites frombody fluids) can be mapped onto networks and comparedwith pre-clinical data and published experiments

• New indications for marketed drugs. Secondaryindications are an important part of follow-updevelopment for bioactive compounds. New therapeuticareas can be suggested by analysis of tissue-specific,disease-specific networks from animals and humanstreated with the drug

• Post-market monitoring. The patients’ data (usuallymetabolites from body fluids) can be stored in a databaseand monitored using networks built during clinical andpre-clinical studies


Revi

ews

•DR

UG

DIS

CO

VER

Y T

OD

AY

:BIO

SILI

CO


1 Nicholson, J.K. and Wilson, I.D. (2003)Understanding ‘global, systems biology:metabonomics and the continuum ofmetabolism. Nat. Rev. Drug Discov. 2, 668–676

2 Kitano, H. (2002) Computational systemsbiology. Nature 420, 206–210

3 Kitano, H. (2002) Systems biology: a briefoverview. Science 295, 1662–1664

4 Hartwell, L.H. et al. (1999) From molecular tomodular cell biology. Nature 402(Suppl.),C47–C50

5 Murray, A.W. et al. (2000) Protein function inthe post-genomic era. Nature 405, 823–826

6 Hood, L. and Perlmutter, R.M. (2004) Theimpact of systems approaches on biologicalproblems in drug discovery. Nat. Biotechnol. 2,1218–1219

7 Chaussabel, D. and Sher, A. (2002) Miningmicroarray expression data by literatureprofiling. Genome Biology, 3, RESEARCH0055

8 Chen, H. and Sharp, B.M. (2004) Content-richbiological network constructed by mining. BMCBioinformatics 5, 147

9 Chaussabel, D. (2004) Biomedical literaturemining: challenges and solutions in the ‘omics’era. Am. J. Pharmacogenomics 4, 383–393

10 Blaschke, K. and Valencia, A. (2001) Canbibliographical pointers for known biologicaldata be found automatically? Proteininteractions as a case study. Comp. Func. Genom2, 196–2006

11 Santos, C. et al. Wnt pathway curation usingautomated natural language processing:combining statistical methods with partial andfull parse for knowledge extraction.Bioinformatics (in press)

12 Daraselia, N. et al. (2004) Extracting proteinfunction information from MEDLINE using afull-sentence parser. In Proc. 2nd EuropeanWorkshop on Data Mining and Text Mining forBioinformatics, pp. 11–18, ECML/PKDD 2004Committee (available online atwww.informatik.hu-berlin.de/Forschung_Lehre/wm/ws04/#proceedings)

13 Chen, T.L. et al. (1991) The two-hybrid system:a method to identify and clone genes forproteins that interact with a protein of interest.Proc. Natl. Acad. Sci. U. S. A. 88, 9578–9582

14 Ito, T. et al. (2001) A comprehensive two-hybridanalysis to explore the yeast proteininteractome. Proc. Natl. Acad. Sci. U. S. A. 98,4569–4574

15 Giot, L. et al. (2003) A protein interaction mapof Drosophila melanogaster. Science 302,1727–1736

16 Li, S. et al. (2004) A map of the interactomenetwork of the metazoan C. elegans. Science 303,540–543

17 Doniger, S.W. et al. (2003) MAPPFinder: usingGene Ontology and GenMAPP to create a globalgene-expression profile from microarray data.Genome Biol. 4, R7

18 Jeong, H. et al. (2001) Lethality and centrality inprotein networks. Nature 411, 41–42

19 Lehner, B. and Frazer, A.G. (2004) A first-drafthuman protein-interactions map. Genome Biol.5, R63

20 Peri, S. et al. (2004). Human protein referencedatabase as a discovery resource for proteomics.Nucleic Acids Res. 32 Database issue, D497-501

21 Pagel, P. et al. The MIPS mammalianprotein–protein interaction database.

Bioinformatics (in press)22 Highes, T.R. et al. (2000) Functional discovery

via a compendium of expression profiles. Cell109-126

23 van Noort, V. et al. (2003) Predicting genefunction by conserved co-expression. TrendsGenet. 19, 238–242

24 Kemmeren, P. et al. (2002) Protein interactionverification and functional annotation byintegrated analysis of genome-scale data. Mol.Cell 9, 1133–1143

25 Deane, C.M. et al. (2002) Protein interactions:two methods for assessment of the reliability ofhigh-throughput observations. Mol. Cell.Proteomics 1, 349–356

26 Ho, Y. et al. (2002) identification of proteincomplexes in Saccharomyces cerevisiae by massspectrometry. Nature 415, 180–183

27 Gavin, A-C. et al. (2002) Functionalorganization of the yeast proteome bysystematic analysis of protein complexes. Nature 415, 141–147

28 Navarro, J.D. and Pandey, A. (2004) Unravelingthe human interactome: lessons from the yeast.Drug Discov. Today 3, 79–84

29 Salwinski, L. and Eisenberg, D. (2003)Computational methods of analysis ofprotein–protein interactions. Curr. Opin. Struct.Biol. 13, 377–382

30 Von Mering, C. et al. (2002) Comparativeassessment of large-scale data sets ofprotein–protein interactions. Nature 417,399–403

31 Bader, J.S. et al. (2004) Gaining confidence inhigh-throughput protein interaction networks.Nat. Biotechnol. 22, 78–85

REVIEWS

FIGURE 5

Applications of network analysis in drug development. (a) Schematic representation of potential signaling pathways from the proteins involved in ovariancancer as described in [58] and recreated in the latest version of PathwayAssist (Ariadne Genomics, Inc.). Red represents the genes that are upregulated in cancercompared with normal ovarian epithelium, green indicates the genes that are downregulated in cancer specimens, and gray shaded symbols represent genes thatdid not show a significant difference between cancer and normal specimens. Expression values are taken from [58]. (b) Prediction of potential human metabolismand toxicity for novel compounds. APAP metabolites (pink hexagons) and their interactions with enzymes and signaling proteins as predicted and visualized byMetaDrug™ (GeneGo, Inc.) [60].The metabolites will interact with (be processed by) several Cytochrome P450 enzymes and transferases. Metabolites also interactwith PXR/RXR – transcriptional regulators of cytochromes. Microarray expression data from APAP-treated [61] rat livers is superimposed on the network.The resultsindicate downregulation of CYP3A4 (marked with blue circle, left) and upregulation of some genes linked to HERG channel (red circles, right). Observeddownregulation of CYP3A4 provides independent corroboration of interactions between APAP metabolites and PXR/RXR [60].


(a) (b)

References


Reviews •D

RU

G D

ISCO

VER

Y TO

DA

Y:B

IOSILIC

O


32 Yu, H. et al. (2004) TopNet: a tool for comparingbiological sub-networks, correlating proteinproperties with topological statistics. NucleicAcids Res. 32, 328–337

33 Barabasi, A.L. and Oltavi, Z.N. (2004) NetworkBiology: understanding the cell’s functionalorganization. Nat. Rev. Genet. 5, 101–113

34 Albert, R. et al. (2000) Error and attack toleranceof complex networks. Nature 406, 378–382

35 Ihmels, J. et al. (2004) Principles oftranscriptional control in the metabolicnetwork of Saccharomyces cerevisiae. Nat.Biotechnol. 22, 86–92

36 Li, S. et al. (2004) A map of the interactomenetwork of the metazoan C. elegans. Science303, 540–543

37 Spirin, V. and Mirny, L.A. (2003) Proteincomplexes and functional modules in molecularnetworks. Proc. Natl. Acad. Sci. U. S. A. 100,12123–12128

38 Rives, A.W. and Galitski, T. (2003) Modularorganization of cellular networks. Proc. Natl.Acad. Sci. U. S. A. 100, 1128–1133

39 Pereira-Leal, J.B. et al. (2004) Detection offunctional modules from protein interactionnetworks. Proteins 54, 49–57

40 Milo, R. et al. (2002) Network Motifs: SimpleBuilding Blocks of Complex Networks. Science298, 824–827

41 Shen-Orr, S.S. et al. (2002) Network motifs inthe transcriptional regulation network ofEscherichia coli. Nat. Genet. 31, 64–68

42 Wuchty, S. et al. (2003) Evolutionaryconservation of motif constituents in the yeastprotein interaction network. Nat. Genet. 35,176–179

43 Tornow, S. and Mewes, H.W. (2003) Functionalmodules by relating protein interactionnetworks and gene expression. Nucleic Acids Res.31, 6283–6289

44 Hanisch, D. et al. (2002) Co-clustering ofbiological networks and gene expression data.Bioinformatics 18(Suppl. 1), S145–S154

45 Ideker, T. et al. (2002) Discovering regulatory

and signalling circuits in molecular interactionnetworks. Bioinformatics 18(Suppl. 1), S233–S240

46 Segal, E. et al. (2003) Discovering molecularpathways from protein interaction and geneexpression data. Bioinformatics 19(Suppl. 1),I264–I272

47 Ogata, H. et al. (2000) Heuristic graphcomparison algorithm and its application todetect functionally related enzyme clusters.Nucleic Acids Res. 28, 4021–4028

48 Nakaya, A. et al. (2001) Extraction of correlatedgene clusters by multiple graph comparison.(Genome Inform. Ser. Workshop). GenomeInform. 12, 44–53

49 Kelley, B.P. et al. (2003) Conserved pathwayswithin bacteria and yeast as revealed by globalprotein network alignment. Proc. Natl. Acad. Sci.U. S. A. 100, 11394–11399

50 Frazer, H.B. et al. (2002) Evolutionary rate in theprotein interaction network. Science 296, 750–752

51 Yu, H. et al. (2004) Genomic analysis ofessentiality within protein networks. TrendsGenet. 20, 227–231

52 The Gene Ontology Consortium (2000) Geneontology: tool for the unification of biology.Nat. Genet. 25, 25–29

53 Nikolsky, Y. et al. (2005) Systems level networkanalysis of OMICs data. Genet. Eng. News 25,28–29

54 Allinen, M. et al. (2004) Molecularcharacterization of the tumor microenvironmentin breast cancer. Cancer Cell 6, 17–32

55 Nikolsky, Y. (2004) System-level analysis ofSAGE and other high-throughput data in thecontext of functional networks. In ProceedingsSAGE 2004 Conference, p. 36, SAGE

56 Ekins, S. et al. Systems biology: applications indrug discovery. In Drug Discovery Handbook.(Gad, S. ed.). Wiley (in press)

57 Kraufthammer, M. et al. (2004) Moleculartriangulation: Bridging linkage and molecular-network information for identifying candidategenes in Alzheimer’s disease. Proc. Natl. Acad.Sci. U. S. A. 101, 15148–15153

58 Donninger, H. et al. (2004) Whole genomeexpression profiling of advance stage papillaryserous ovarian cancer reveals activatedpathways. Oncogene 23, 8065–8077

59 Nikolsky, Y. et al. A novel method forgeneration of signature networks as biomarkersfrom complex high-throughput data. Tox.Letters (in press)

60 Ekins, S. et al. A novel method for visualizingnuclear hormone receptor networks relevant todrug metabolism. Drug Metab. Dispos. (in press)

61 Huang, Q. et al. (2004) Gene expressionprofiling reveals multiple toxicity endpointsinduced by hepatotoxicants. Mutat. Res. 549,147–168

62 Vo, T.D. et al. (2004) Reconstruction andfunctional characterization of the humanmitochondrial metabolic network based onproteomic and biochemical data. J. Biol. Chem.279, 39532–39540

63 Papin, J.A. and Palsson, B.O. (2004) The JAK-STAT signaling network in the human B-cell: anextreme signaling pathway analysis. Biophys. J.87, 37–46

64 Slepchenko, B.M. et al. (2003) Quantitative cellbiology with the Virtual Cell. Trends Cell Biol.13, 570–576

65 Bader, G.D. et al. (2003) BIND: the BiomolecularInteraction Network Database. Nucleic Acids Res.31, 248–250

66 Salwinski, L. et al. (2004) The database ofinteracting proteins: 2004 update. Nucleic AcidsRes. 32, D449–451

67 Zanzoni, A. et al. (2002) MINT: a molecularinteraction database. FEBS Lett. 513, 135–140

68 Mewes, H.W. et al. (2002) MIPS: a database forgenomes and protein sequences. Nucleic AcidsRes. 30, 31–34

69 von Mering, C. et al. (2003) STRING: a databaseof predicted functional associations betweenproteins. Nucleic Acids Res. 31, 258–261

70 Kanehisa, M.A. et al. (2002) The KEGGdatabases at GenomeNet. Nucleic Acids Res. 30,42–46

REVIEWS

Related articles in other Elsevier journals

Global properties of biological networksGrigorov, M.G. (2005) Drug Discov.Today 10, 365–372

Comparison of network-based pathway analysis methodsPapin, J.A. et al. (2004) Trends Biotechnol. 22, 400–405

Reconstruction of microbial transcriptional regulatory networksHerrgard, M.J. et al. (2004) Curr. Opin. Biotechnol. 15, 70–77

Building with a scaffold: emerging strategies for high- to low-level cellular modelingIdeker,T. and Lauffenburger, D. (2003) Trends Biotechnol. 21, 255–262

biological networks and analysis of experimental data in...

Documents