investigations into the evolution of biological networks189160/fulltext01.pdf · investigations...

49
Investigations into the evolution of biological networks SARA LIGHT Stockholm Bioinformatics Center Department of Biochemistry and Biophysics Stockholm University Sweden Stockholm 2006

Upload: others

Post on 25-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

Investigations into the evolution of

biological networks

SARA LIGHT

Stockholm Bioinformatics Center

Department of Biochemistry and BiophysicsStockholm University

Sweden

Stockholm 2006

Page 2: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

Sara LightStockholm Bioinformatics CenterStockholm UniversityAlbanova University CenterSE-106 91 StockholmSwedenE-mail: [email protected]: +46-(0)8-5537 8562Fax: +46-(0)8-5537 8214

Doctoral Thesisc©Sara Light, 2006ISBN 91-7155-273-1Printed by Universitetsservice US AB, Stockholm

2

Page 3: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

To Maria, Mom, Dad and Bill.

Page 4: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

Investigations into the evolution of biological networks

Sara Light, [email protected] Bioinformatics CenterStockholm University, SE-106 91 Stockholm, Sweden

Abstract

Individual proteins, and small collections of proteins, have been extensively studiedfor at least two hundred years. Today, more than 350 genomes have been completelysequenced and the proteomes of these genomes have been at least partially mapped.The inventory of protein coding genes is the first step toward understanding thecellular machinery. Recent studies have generated a comprehensive data set for thephysical interactions between the proteins of Saccharomyces cerevisiae, in additionto some less extensive proteome interaction maps of higher eukaryotes. Hence, itis now becoming feasible to investigate important questions regarding the evolutionof protein-protein networks. For instance, what is the evolutionary relationshipbetween proteins that interact, directly or indirectly? Do interacting proteins co-evolve? Are they often derived from each other? In order to perform such proteome-wide investigations, a top-down view is necessary. This is provided by network (orgraph) theory.

The proteins of the cell may be viewed as a community of individual moleculeswhich together form a society of proteins (nodes), a network, where the proteinshave various kinds of relationships (edges) to each other. There are several differ-ent types of protein networks, for instance the two networks studied here, namelymetabolic networks and protein-protein interaction networks. The metabolic net-work is a representation of metabolism, which is defined as the sum of the reactionsthat take place inside the cell. These reactions often occur through the catalyticactivity of enzymes, representing the nodes, connected to each other through sub-strate/product edges. The indirect interactions of metabolic enzymes are clearlydifferent in nature from the direct physical interactions, which are fundamental tomost biological processes, which constitute the edges in protein-protein interactionnetworks.

This thesis describes three investigations into the evolution of metabolic andprotein-protein interaction networks. We present a comparative study of the im-portance of retrograde evolution, the scenario that pathways assemble backwardcompared to the direction of the pathway, and patchwork evolution, where enzymesevolve from a broad to narrow substrate specificity. Shifting focus toward networktopology, a suggested mechanism for the evolution of biological networks, preferentialattachment, is investigated in the context of metabolism. Early in the investigationof biological networks it seemed clear that the networks often display a particu-lar, ’scale-free’, topology. This topology is characterized by many nodes with fewinteraction partners and a few nodes with a large number of interaction partners

Page 5: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

(hubs). While the second paper describes the evidence for preferential attachmentin metabolic networks, the final paper describes the characteristics of the hubs inthe physical interaction network of S. cerevisiae.

Page 6: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

CONTENTS

Contents

1 Publications 21.1 Publications included in this thesis . . . . . . . . . . . . . . . . . . . 21.2 Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Introduction 3

3 The evolution of biological networks 53.1 Metabolic networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1.1 Defining metabolic networks . . . . . . . . . . . . . . . . . . . 73.1.2 The evolution of metabolism . . . . . . . . . . . . . . . . . . . 8

3.2 Protein-protein interaction networks . . . . . . . . . . . . . . . . . . . 113.2.1 Context-based protein-protein interaction prediction . . . . . . 113.2.2 Experimental determination of individual protein-protein in-

teractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2.3 Experimental determination of protein complexes . . . . . . . 133.2.4 The need for validation of protein-protein interactions . . . . . 133.2.5 Toward the human protein-protein interaction network . . . . 143.2.6 The evolution of protein-protein interactions . . . . . . . . . . 15

3.3 On the structure of biological networks . . . . . . . . . . . . . . . . . 163.3.1 What is the advantage of scale-freeness? . . . . . . . . . . . . 163.3.2 How do biological scale-free networks evolve? . . . . . . . . . . 183.3.3 The relationship between protein-protein interaction network

topology and evolution . . . . . . . . . . . . . . . . . . . . . . 183.3.4 Is scale-freeness of biological importance? . . . . . . . . . . . . 19

4 Techniques and databases 214.1 Representation of the metabolic network . . . . . . . . . . . . . . . . 214.2 Representation of the protein-protein interaction network . . . . . . . 224.3 Domain assignment and protein age . . . . . . . . . . . . . . . . . . . 234.4 Statistical tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 The present investigations 255.1 Retrograde vs Patchwork evolution (Paper I) . . . . . . . . . . . . . . 265.2 Preferential attachment in metabolic networks (Paper II) . . . . . . . 275.3 Domain distance as an alternative homology measure (Paper III) . . . 285.4 The characteristics of hubs in the protein-protein interaction network

of Saccharomyces cerevisiae (Paper IV) . . . . . . . . . . . . . . . . . 29

6 Conclusions and reflections 31

7 Acknowledgements 33

1

Page 7: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

1 PUBLICATIONS

1 Publications

1.1 Publications included in this thesis

Paper I

Sara Light and Per KraulisNetwork analysis of metabolic enzyme evolution in Escherichia coli.BMC Bioinformatics, 5:15, 2004.

Paper II

Sara Light, Per Kraulis and Arne ElofssonPreferential attachment in the evolution of metabolic networks.BMC Genomics, 6:159, 2005e1.

Paper III

Asa K. Bjorklund†1, Diana Ekman†1, Sara Light, Johannes Frey-Skott and ArneElofsson† These authors contributed equally.Domain rearrangements in protein evolution.Journal of Molecular Biology, 353, 911-923, 2005.

Paper IV

Diana Ekman†1, Sara Light†1, Asa K. Bjorklund and Arne Elofsson† These authors contributed equally.What properties characterize the hub proteins of the protein-protein interactionnetwork of Saccharomyces cerevisiae?Submitted, 2006

1.2 Other publications

Christian Roth, Shruti Rastogi, Lars Arvestad, Katharina Dittmar, Sara Light, Di-ana Ekman and David A. LiberlesEvolution after gene duplication: Models, mechanisms, sequences, systems and or-ganismsSubmitted, 2006

Paper I-III reproduced with permission.

2

Page 8: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

2 INTRODUCTION

2 Introduction

Since the first bacterial genome, Haemophilus influenzae, was sequenced in 1995,over 350 genomes have been completely sequenced. Currently, most of the sequencedgenomes belong to the eubacterial domain of life, but the fraction of sequenced eu-karyotic genomes is expected to grow in the next few years. Knowing the sequence inwhich the nucleotides appear in genomes, and finding the genes encoded therein, arenecessary stepping stones towards a greater understanding of the cellular machineryand biological diversity. For instance, until the human genome was sequenced, thenumber of human genes was expected to be roughly 100,000, but was shown to beless than 30,000 (Consortium., 2001; Venter & et al, 2001). Hence, the number ofhuman genes is around 5 times that of the pathogenic proteobacterium Pseudomonas

aeruginosa, which is perhaps surprising considering the difference in complexity, andsize alone, between these organisms. However, a relatively small difference in thenumber of genes can result in a great number of variations of the co-expression pat-terns for each gene combination. Consequently, it is now generally believed that thecomplexity of multicellular organisms probably originates in part from the combi-natorial interactions between proteins (Claverie, 2001). In addition, multi-domainproteins are more common in eukaryotes (Apic et al., 2001b; Ekman et al., 2005;Gerstein & Levitt, 1998; Liu & Rost, 2004), which may, through a wider range offunctional combinations of domains, result in greater complexity. Furthermore, al-ternative splicing of eukaryotic transcripts is likely to increase the protein repertoryof eukaryotes threefold and, finally, the transcription machinery is more complex ineukaryotes compared to prokaryotes.

Life on Earth originated around 3.8 billion years ago. It is likely that the firstorganisms were prokaryotes, i.e. unicellular organisms without well defined inter-nal compartments (organelles). The diversity of life is derived through evolution;the process of amplification and mutation of genetic material, followed by varyingproportions of chance and natural selection. Alterations of the genome sequenceoccur, for example, as a result of exposure to detrimental agents (toxins, radiation,viruses) and consist of, for instance, point mutations as well as insertions or dele-tions of nucleotides, in addition to larger chromosomal events. These mechanismsalter the genetic material, sometimes increasing the likelihood of survival for its hostorganism (positive selection), but more often decreasing it (negative selection) orhaving negligible impact (neutral evolution). The changes on the genomic sequencecan lead to changes on the cellular networks, which may have consequences on thetissue or organism level.

An important source of new genetic material is gene duplication (Ohno, 1970;Lynch & Conery, 2000). Indeed, estimates show that 30-60% of the genes in eu-karyotic genomes are duplicates (Ball & Cherry, 2001). Genes that are derived fromgene duplications are called paralogs, see Figure 1. When a gene is duplicated,one gene still performs its original function, while the duplicate may be somewhataltered and eventually lost from the genome, through deactivating mutations. In

3

Page 9: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

2 INTRODUCTION

Z X

XA XB XC

A B C A B C

LCA

Z1B

Z1C

Z2C

Z3CZ2BZA

Figure 1: Paralogs and orthologs. Paralogs, pairs of proteins which have diverged through dupli-cation, are often contrasted to orthologs, which have diverged through speciation. Here, A, B andC represent three species, while X and Z are two genes, which have diverged after duplication.Hence, the two genes are paralogs. Further, two branches, representing the evolution of each oneof these genes, respectively, are shown. The X-branch shows an example of genes which are allorthologous to each other. In the Z-branch, a similar scenario has occurred, with additional lineagespecific duplication events in species B and C. Z1B and Z2B are therefore inparalogs with respectto the radiation of A from B. Furthermore, Z1B and Z2B are both orthologous to ZA.

some instances, however, such as when the increased copy number conveys a selec-tive advantage to the organism, the duplicate may be retained. Furthermore, byfixing beneficial mutations, one of the genes may acquire a new functionality (ne-ofunctionalization) (Ohno, 1970), or, alternatively, the genes may evolve throughcomplementary degenerative mutations towards two genes whose functions, takentogether, perform the function of the original gene (subfunctionalization) (Forceet al., 1999).

A rare form of duplication is whole genome duplication, which is believed tohave taken place at least once during the evolution of Saccharomyces cerevisiae

(Kellis et al., 2004), and could be an important source of new functional subnetworks(modules) in the eukaryotic cell (Conant & Wolfe, 2006). In prokaryotes, transferof genetic material between unrelated organisms is an important source of novelgenetic material. Although the extent of the evolutionary impact of horizontal

4

Page 10: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

3 THE EVOLUTION OF BIOLOGICAL NETWORKS

gene transfer (HGT) is still under debate (Kurland et al., 2003; Snel et al., 2002;Lawrence & Ochman, 1997), it is probably an important evolutionary process inmicrobial species (Boucher et al., 2003).

Proteins, the gene products, are the building blocks of the cell and performcertain functions one by one, as in the instance of some enzymes, or together withother proteins. In yeast there are around 6,000 proteins which often act together,either in small groups or in large complexes. Much like the individual human beingsin this world are, for some purposes, viewed collectively as a society, a network ofpeople, the proteins of the cell, often referred to as the proteome, can be seen as acommunity of individual molecules which together form a network. Certainly, somehuman behaviours cannot be properly understood outside the context of humansociety, and similarly, some properties of proteins may not make sense unless viewedfrom the top down. Therefore, a network perspective of the proteome is an importanttool, which may enable another level of understanding of living organisms. Thisthesis describes investigations into the evolution of metabolic and protein-proteininteraction networks.

3 The evolution of biological networks

A network, or graph, consists of objects called nodes (vertices) connected by edges(links). Frequently, a graph is depicted in diagrammatic form as a set of nodes,joined by edges, see Figure 2. In Papers I and II, the nodes of the metabolic networkare enzymes connected to each other through compounds, while the nodes in theprotein-protein interaction network (PPIN) in Paper IV represent the proteins inthe network. In the latter network, the edges are the physical interactions betweenthe proteins.

3.1 Metabolic networks

Metabolism, in its broadest sense, is the sum of the reactions which take place insidethe cell. It includes the processes by which cells take up and convert compoundsfrom precursors to complex compounds, anabolism, and the processes by whichcompounds are degraded, catabolism. The most common nutrients are probablycarbohydrates, e.g. sucrose. The conversion of sugars to complex compounds is fa-cilitated by a set of proteins, enzymes, which accelerate the rates of most reactionsin the cell. Almost all living organisms contain the machinery to expediently de-grade sugars, produce high energy compounds, such as adenosinetriphosphase (ATP)and reduced nicotinamide adenine dinucleotide (NADH) and produce precursors forcomplex molecules, such as proteins.

Enzymes, which are almost always proteins, catalyze many reactions in the cell,without being altered themselves through the process. Generally, the enzyme formsan enzyme-substrate complex with the compound, often referred to as the substrate,

5

Page 11: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

3 THE EVOLUTION OF BIOLOGICAL NETWORKS

Connectivity Clustering coefficient

c=1

c=0

c=1

c=0

c=0

k=2

k=2

k=2

k=3

k=3

k=2

a) b)

k=6

c=0.67

Hub Hubk=6 c=0.2

Figure 2: Network parameters. a) The connectivity (k) of a node is defined as the number of nodesit is connected to through an edge. Here, the edges are undirected, although directed graphs arealso common. The highest connected node in the network is the hub. b) The clustering coefficient(c) is a measure of the interconnectivity of the neighbors of a node. For instance, the hub in thisnetwork is connected to five nodes, which could form 10 pairs of nodes among each other, but thereare only 2 such pairs. The clustering coefficient is defined as the number of actual neighbor pairsdivided by the maximum number of pairs (here c=0.2).

and optimizes the environment surrounding the substrate, thus drastically increasingthe probability that a reaction will take place and the substrate be transformed toproduct. In one of the most cited databases for chemical reactions, LIGAND (Gotoet al., 2002), there are currently more than 7,000 biological reactions, involving over15,000 different compounds, most of which are catalyzed by enzymes. The enzymesmay be classified into six primary classes according to the type of reaction whichthey catalyze;

1. Oxidoreductases; Transfer of electrons; e.g. alcohol dehydrogenase.

2. Transferases; Transfer of functional groups; e.g.citrate synthase.

3. Hydrolases; using or producing water; e.g. glucose-6-phosphatase.

4. Lyases; cleavage of C-C, C-O, C-N bonds; e.g. fructose-bisphosphate aldolase.

5. Isomerases; transfer of groups within a molecule; e.g. triose-phosphate iso-merase.

6. Ligases; NTP-coupled bond formation; e.g acetate-CoA ligase.

6

Page 12: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

3 THE EVOLUTION OF BIOLOGICAL NETWORKS

The enzymes may be classified further into the Enzyme Commission (EC) num-bers. For instance, the enzyme alcohol dehydrogenase, which catalyzes the reactionwhere an alcohol is oxidized to an aldehyde or ketone using NADH as an electronacceptor, is classified as an oxidoreductase because its reaction is an oxidation, andfurther, it is classified as EC 1.1 since it acts on the CH-OH bond, and subsequentlyas EC 1.1.1 because NADH (or NADPH) acts as the electron acceptor. Finally,the enzyme is classified as EC 1.1.1.1 because it is acting on primary or secondaryalcohols and hemiacetals. There is considerable variation in the specificity of en-zymes. Some enzymes are quite specific and may act on one compound only. Inaddition, the specificity of a certain enzyme often varies between species. Again,alcohol dehydrogenase serves as an example since animal alcohol dehydrogenasesalso use cyclic alcohols as substrates, while the yeast variant does not.

Naturally, not all enzymes are present in all organisms. For instance, animalslack most of the enzymes necessary for alkaloid synthesis which are present in someplants, and many intracellular prokaryotes lack the enzymes necessary for amino acidsynthesis. In recent years, it has become clear that there is substantial variation inthe architecture of metabolism in different organisms both between and within thedomains of life. For example, the citric-acid cycle, thought to be one of the oldestparts of metabolism, seems to be absent or incomplete in most of the organismsstudied (Huynen et al., 1999). In addition, although orthologs, see Figure 1, ofhalf of the enzymes in Escherichia coli are present in all three domains of life,there is considerable variation in the groupings of enzymes into metabolic pathways(Peregrin-Alvarez et al., 2003).

3.1.1 Defining metabolic networks

The whole metabolic network of E. coli consists of all the metabolic reactions in E.

coli, such as the reactions in the biotin metabolism, the glycolysis and the aminoacid metabolism. Traditionally, metabolic networks have been drawn with enzymesas edges, see Figure 3, but with a network view, the reverse representation is oftenmore intuitive. In the latter representation, the connections between the enzymenodes in metabolic networks are the compounds involved in the reactions. If anenzyme E1 catalyzes a reaction where a product P is produced, and P is used as asubstrate by an enzyme E2, the two enzymes are neighbors in the network. Althoughpathways are biologically meaningful partitions of the full metabolic network, thepartitioning of the metabolic network into pathways is not always straightforward(Schuster et al., 2000). Therefore, there could be correlations that are not visiblefrom a pathway perspective which will emerge from a whole-network oriented view.

There is also an element of arbitrariness involved in which compounds are con-sidered promiscuous (compounds involved in many reactions, e.g. ATP) and whichare not. A similar element of arbitrariness is involved in the determination of whichcompounds are cofactors (e.g. metal ions or NAD/NADH) and which compoundsare the main substrates or products of the reactions. The promiscuous compounds

7

Page 13: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

3 THE EVOLUTION OF BIOLOGICAL NETWORKS

8−amino−7−oxononanoatePimeloyl−CoA

S−adenosyl−L−methionine

S−adenosyl−4−methylthio−2−oxobutanoate

7,8−diaminononanoateDethiobiotin 6.3.3.3

2.6.1.62

Sulfur donor

Biotin

2.8.1.6

PiADP

CO2ATP

2.3.1.47

Coenzyme ACO2

L−alanine

Figure 3: Common representation of the biotin metabolism. Here, the main reactants represent thenodes of the network and the enzymes represent the edges. This drawing of the biotin metabolismwas redrawn from EcoCyc (Karp et al., 2002).

pose a potential problem to metabolic network studies since these compounds partic-ipate in a large number of reactions, and their exclusion and inclusion could changethe properties, and relevance, of the network representation considerably. There arenumerous ways to approach the problem of compound promiscuity, one of which is tosimply remove compounds which are generally considered non-essential in a certainreaction, and another is to count the number of reactions a compound participatesin within the complete metabolic network of E. coli and exclude those which aremost promiscuous from the network.

3.1.2 The evolution of metabolism

In 1945 one of the first theories regarding the evolution of metabolic pathways, oftenreferred to as retrograde evolution, was proposed by Horowitz (Horowitz, 1945). Thetheory was primarily addressing the problem of how one enzyme could provide aselective advantage to an organism before the whole pathway had been completed. Itstates that during evolution pathways assembled backward compared to the directionof the pathway in response to depletion of substrates from the environment. Asan example consider the following scenario: Enzyme E catalyzes reaction A → B,where B is essential to the organism. A is depleted from the environment, andsubsequently, an organism harboring an enzyme E’ with the ability to catalyze areaction producing A from some other substrate from the environment would be atan advantage. Since E can already bind A there is a greater chance that E rather

8

Page 14: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

3 THE EVOLUTION OF BIOLOGICAL NETWORKS

than an enzyme without an affinity for A will be mutated into E’, if duplicated, seeFigure 4.

B

BAC

A

A

E

EE’

Figure 4: Retrograde evolution. Retrograde evolution describes how pathways evolve backwardscompared to the direction of the pathway. A is the substrate that is depleted from the environment.E and E’ are the enzymes of the reactions, where E’ is a duplicate of E.

In 1976 Jensen described recruitment evolution (Jensen, 1976), more often re-ferred to as the patchwork evolution model (Lazcano & Miller, 1996), as an alter-native to retrograde evolution. The retrograde evolution model demands an accu-mulation of substrates in the environment. Since many intermediate substrates arechemically unstable, the retrograde evolution model cannot explain the evolution ofreactions involving such substrates. The patchwork evolution model states that thenumber of enzymes was initially small, and broad substrate specificities enabled asmany diverse functions as possible given the small enzymatic repertory. As evolutionproceeded specialization took place by gene duplication. For example consider thefollowing: An enzyme E catalyzes a reaction where either one of the substrates S1and S2 is accepted. Through gene duplication and random mutation two differentversions of E evolve; E’, which only accepts S1 as a substrate, and E” which onlyaccepts S2 as a substrate. Jensen could support his theory by the observation thatseveral contemporary enzymes have quite broad substrate specificities. As a sec-ondary theory to patchwork evolution, Jensen suggested that pathways may evolveen bloc, i.e. chains of reactions within pathways evolve together as a result of broadsubstrate specificities, see Figure 5. There is a certain amount of support for thistheory since several reaction chains in different pathways are highly analogous (forinstance reaction chains in the TCA cycle and lysine biosynthesis).

The retrograde evolution model was at first supported by the discovery of oper-ons, where the functionally related genes in operons were thought to have evolvedthrough tandem duplications (Horowitz, 1965). It was later established that thegenes present in the same operon often do not show significant sequence similarity.

9

Page 15: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

3 THE EVOLUTION OF BIOLOGICAL NETWORKS

S11S12S13S14S15S16S17

S21S22S23S24S25S26S27

S31S32S33S34S35S36S37

S41S42S43S44S45S46

E1 E2 E3

S47

a)

b) S11 S21 S31

S17

E2’ E3’

S37S27 S47

S41E1’

E1’’ E2’’ E3’’

Figure 5: Pathway evolution en bloc. a) E1, E2 and E3 are ancient enzymes with very broadsubstrate specificities. S11-17, S21-27 and S31-37 are substrates accepted by E1, E2 and E3,respectively. The initial function of selective advantage is the product S41 from the substrate chainS11-S21-S31-S41 but mutations of the broad specificity enzymes leads to the substrate chain S17-S27-S37-S47, where S47 gives a significant selective advantage. b) This may favour the duplicationof E1, E2 and E3, which allows specialization of the enzymes to efficiently produce both S41 andS47.

Instead, some gene clusters can be explained by the efficient transfer of functionalmodules between unrelated organisms through horizontal gene transfer (Lawrence& Roth, 1996).

There are a few known homologous genes coding for enzymes which catalyzeconsecutive reactions and which therefore represent possible cases of retrograde evo-lution: trpC, trpA and trpB, which catalyze consecutive steps of the tryptophanbiosynthesis (Wilmanns et al., 1991), hisF and hisA in the histidine biosyntheticpathway (Fani et al., 1995) and metB and metC in the methionine biosynthesis(Belfaiza et al., 1986). Interestingly, a mechanism analogous to retrograde evolu-tion can be successfully applied to the evolution of vertebrate steroid receptors.According to Thornton’s ligand exploitation process, the terminal ligand in thebiosynthetic pathway of steroids in vertebrates is the first one for which a receptorevolved, namely an estrogen receptor (Thornton, 2001). Furthermore, there are afew studies which give some support for the retrograde evolution model. A super-family has a general tendency to appear in one or two particular pathway(s) (Saqi &Sternberg, 2001) and, additionally, homologous enzymes are found at close distanceswithin the (extended) pathways of E. coli (Rison et al., 2002). Finally, homologousenzymes are also found close to each other in the whole metabolic network (Alveset al., 2002).

The patchwork evolution model holds that there should be many pairs of ho-mologous enzymes catalyzing analogous reactions, where one or more substrates arenon-identical but similar. Support for this theory is more substantial than for theretrograde evolution model (Copley & Bork, 2000; Teichmann et al., 2001; Tsoka

10

Page 16: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

3 THE EVOLUTION OF BIOLOGICAL NETWORKS

& Ouzounis, 2001; Petsko et al., 1993). The TIM-barrel containing enzymes havebeen found in many different pathways (Copley & Bork, 2000) and the homologouspairs of small molecule metabolism enzymes of E. coli have been shown to be evenlydistributed within and across pathways (Teichmann et al., 2001; Tsoka & Ouzounis,2001).

In conclusion, recent studies have shown that there is little support for the retro-grade evolution model while Jensen’s patchwork evolution model has substantiallymore support (Rison et al., 2002; Alves et al., 2002). However, the importance ofthese mechanisms, and others depending on gene duplication, in the evolution ofthe metabolic network of E. coli may be smaller than the impact of horizontal genetransfer (Pal et al., 2005). A recent study shows that the metabolic network of E.

coli evolves by incorporating horizontally transferred genes which are predominantlyinvolved in transportation (Pal et al., 2005).

3.2 Protein-protein interaction networks

Although enzymes sometimes physically interact, the metabolic network describedabove consists of enzyme nodes connected to each other merely by compound edges.The indirect interactions of metabolic enzymes are clearly different to the directphysical interactions between proteins. Such interactions are important to mostbiological processes, since many proteins need to interact with other proteins toperform their functions properly. Hence, knowledge about the interactions betweenproteins is crucial for understanding biological functions. Furthermore, the functionsof many proteins are unknown and identification of the physical interactions in whichthese proteins participate may give a first indication of their function.

3.2.1 Context-based protein-protein interaction prediction

The budding yeast S. cerevisiae was the first eukaryotic genome that was sequenced(Goffeau et al., 1996) and, therefore, the first systematic genome-wide studies of pro-tein interactions have been conducted on S. cerevisiae. After the publication of the S.

cerevisiae genome sequence, several computational methods based on genomic con-text were developed for protein-protein interaction (PPI) prediction. For instance,if two proteins are fused in at least one genome, but appear as separate genes inanother genome, there is an increased chance that the proteins interact, or are other-wise functionally coupled (Marcotte et al., 1999; Enright et al., 1999). Furthermore,although gene order is generally poorly conserved in prokaryotic genomes, the geneswhich do display a conserved gene order across prokaryotic species are likely to phys-ically interact (Dandekar et al., 1998). Finally, phylogenetic profiles have been usedto identify which proteins often occur together, which is another strong indicationof proteins with a greater likelihood of interacting (Pellegrini et al., 1999).

11

Page 17: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

3 THE EVOLUTION OF BIOLOGICAL NETWORKS

3.2.2 Experimental determination of individual protein-protein interac-tions

The known metabolic reactions have, for the most part, been identified throughlaborious studies of individual enzymes. Furthermore, the function of a newly se-quenced gene may be inferred from its homology to a protein of identified function.Similarly, the physical interactions between proteins had to be detected by char-acterization of individual interactions. Recently, however, high-throughput (HT)studies have been developed, by which protein-protein interactions may be identi-fied through genome wide studies. Consequently, the past few years have seen agreat increase in the number of known protein-protein interactions. The two mostimportant methods are affinity purification followed by mass spectrometry, whichis a common technique for identifying protein complexes (Gavin et al., 2002; Hoet al., 2002), while the yeast two-hybrid method is used for identifying individualprotein-protein interactions (Uetz et al., 2000; Ito et al., 2001; Fields & Song, 1989).

The yeast two-hybrid method (Y2H) (Fields & Song, 1989) can be used to deter-mine if two particular proteins interact. In this method, the two proteins of interest,A and B, are fused to two different kinds of proteins. A (often referred to as bait)is fused to a DNA-binding protein, which binds to a specific stretch of DNA slightlyupstream of the reporter gene, a gene encoding a protein that reports the presenceof an interaction between A and B. B (often referred to as the prey) is fused to anactivating domain (AD), which activates the transcription of the reporter gene. Thereporter gene will not be transcribed unless AD is present, and AD is only presentif A interacts with B. Therefore, there is a signal, such as for instance growth onhistidine-free medium, only when A and B interact.

The two-hybrid method is tractable for systematic large-scale studies and hasbeen used to study the entire proteome of S. cerevisiae (Ito et al., 2001; Uetzet al., 2000), Caenorhabditis elegans (Li et al., 2004) and Drosophila melanogaster

(Giot et al., 2003). Furthermore, Y2H is a sensitive method capable of detectingtransient as well as stable interactions. However, the method suffers from a fewserious problems and many of the interactions detected by Y2H, possibly as manyas 50-90% (Mrowka et al., 2001; Sprinzak et al., 2003), are probably erroneous (falsepositives). Further, the detection of the interactions takes place in the nucleus,which is not the natural environment for a large portion of the interactions. Whilethe latter could lead to that many interactions are missed, it may also lead tothat other interactions are misconstrued as interactions when, in fact, these maybe secondary interactions mediated through a third protein, or an RNA-molecule,connecting the two proteins under investigation. Finally, there is a large number ofknown interactions between proteins which are missed with Y2H (false negatives),such as for instance the known interaction between the α and β electron transferflavoproteins (Aloy & Russell, 2002). The large number of false negatives is likely tobe caused by that proteins only interact when certain activation signals have inducedconformational changes in one or both of the interacting proteins (Ito et al., 2001).

12

Page 18: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

3 THE EVOLUTION OF BIOLOGICAL NETWORKS

In addition, the unnatural mechanism of fused proteins within a compartment, thenucleus, where most of the proteins do not naturally interact, is a likely cause forthe absence of known protein-protein interactions in the two-hybrid screens. Finally,membrane protein interactions are unlikely to be detectable by Y2H.

3.2.3 Experimental determination of protein complexes

The affinity purification methods, on the other hand, do not identify individual in-teractions between proteins, but are used to determine which proteins appear incomplexes together. In tandem affinity purification (TAP) (Rigaut et al., 1999),some proteins are selected as baits which are used to fish for the proteins that forma complex with the bait. The baits are fused with two affinity tags, typically Staphy-

lococcus aureus protein A (ProtA) and calmodin-binding peptide (CBP) separatedby a TEV protease cleavage site. The tags are used to attach the bait to an affin-ity chromatography column in two tandem steps. First, the mixture of proteins iseluted on the IgG affinity column and the bait proteins’ ProtA tags bind to the IgGmatrix together with its complex of proteins. Second, TEV protease is added to theelute to extract the complex. Third, the bait and complex are added to the secondcolumn, where the CBP tag on the bait binds to the calmodin of the matrix. Thematrix is thoroughly washed, and finally the purified complex is eluted with EGTA.

Clearly, the rigorous purification steps in TAP should disfavour the detection oftransient interactions within complexes, and, consequently, most complexes foundare likely to be stable. Furthermore, the exact interactions between the proteins inthe complexes detected by TAP have not been determined. Some of the proteins inthe complexes interact directly with each other, but others are at the outskirts ofthe protein complex and are not in each others proximity, although they are likelyto be functionally related.

Affinity purification methods have been successfully applied to large scale studieson the proteome of S. cerevisiae (Gavin et al., 2002; Ho et al., 2002). The overlapbetween the interaction pairs found with affinity purification methods and Y2H issmall, partly because the methods complement each other (Aloy & Russell, 2002).However, the lack of overlap between similar methods shows that both Y2H andaffinity purification methods suffer from shortcomings.

3.2.4 The need for validation of protein-protein interactions

Although high-throughput methods are comprehensive and less likely to be subjectto the biased expectations of individual researchers, unlike low-throughput methods,there is a small overlap between the interaction pairs produced by different methods(Uetz & Finley, 2005). For instance, the two independent Y2H screens of S. cere-

visiae merely showed a 17-20% overlap (Ito et al., 2001). It is not yet clear if thislack of overlap is caused by a limited coverage of the vast yeast interactome, or if itis caused by false positives. However, among the high-throughput methods, TAP is

13

Page 19: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

3 THE EVOLUTION OF BIOLOGICAL NETWORKS

the most reliable HT method according to a recent analysis (Cornell et al., 2004).Furthermore, it is clear that the protein interactions detected by HT Y2H are dif-ferent in nature from those detected by low-throughput Y2H experiments, collectedfrom the Munich Information Center for Protein Sequences (MIPS) (Mewes et al.,2002). In fact, either the fraction of false positives for HT Y2H is at least 40%,or the low-throughput experiments are clearly biased (Mrowka et al., 2001). Sincethere is a small overlap between the interaction pairs produced by these methods,there is a need for reliable validation measures.

Generally, a protein-protein interaction that has been observed more than onceis more reliable, although false positives may be highly reproducible (Fields, 2005).Furthermore, computational methods are frequently utilized to indicate the reli-ability of protein-protein interactions. For instance, functional annotations andsubcellular localizations may provide an indication of the reliability of a particularprotein-protein interaction since proteins which are predicted to be active in the samesubcellular location and have related functions are more likely to interact (Sprinzaket al., 2003). In addition, expression patterns for proteins in the same complex areexpected to be correlated, otherwise their expression would be unnecessarily ener-getically costly (Jansen et al., 2002), although less so for interactions of a transientnature. Consistently, interacting proteins are often co-expressed (Grigoriev, 2001;Jansen et al., 2002) and, in combination, expression analysis and protein-proteininteraction data can be used to initially characterize proteins of unknown function(Kemmeren et al., 2002). Furthermore, structural information (Edwards et al., 2002;Aloy & Russell, 2002) and functional annotation (Marcotte et al., 1999) can be usedto validate protein-protein interactions.

Interactions which have been identified in low-throughput experiments are, as arule, considered more reliable. Nevertheless, despite the problems associated withHT protein-protein datasets, their expedience warrants a continued effort to developimprovements. Furthermore, in the unlikely event that there will be no significantadvances of HF PPI studies in the coming years, the methods available today can stillpresent candidates for protein interactions (Fields, 2005). However, recent technicalimprovements in pooling strategies indicate that the accuracy of HT Y2H could besignificantly increased, while the number of screens is simultaneously decreased (Jinet al., 2006).

3.2.5 Toward the human protein-protein interaction network

Hopefully, many of the efforts concerning the yeast protein-protein interaction net-work may be transferred to Metazoa, and ultimately to the human protein-proteininteraction network (PPIN). The computational methods above can be applied tonewly sequenced genomes, and preliminary interaction networks for these organismscan thereby be inferred (Date & Marcotte, 2003; von Mering et al., 2003). For in-stance, there are small conserved subnetworks which are shared between the PPINsof organisms as diverse as Helicobacter pylori and S. cerevisiae (Kelley et al., 2003),

14

Page 20: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

3 THE EVOLUTION OF BIOLOGICAL NETWORKS

and some S. cerevisiae interactions were successfully modeled to the C. elegans in-teraction network (Li et al., 2004). Although such computational methods, basedon genomic context or homology, or both, are likely to contribute to the understand-ing of the PPINs of Metazoa, there is clearly a need to perform high-throughputexperiments on the higher organisms.

Current estimates of the number of protein-protein interactions in yeast indicatethat there are 10,000-30,000 interacting pairs while the corresponding estimate forthe human interactome is 40,000-200,000, excluding considerations for splice variantsor post-translational modification. Although the interactomes of the nematode C.

elegans (Li et al., 2004) and the fruit fly D. melanogaster (Giot et al., 2003) havebeen studied by HT techniques, there are distinct networks in each organism that arenot expected to be transferable to vertebrates, and there are probably vertebratespecific subnetworks. Presently, the human interactome is being studied throughrevised yeast two-hybrid methods, where the expression levels of the fusion proteinsare low and auto-activators are eliminated, through usage of three different Gal4-inducible promoters (Rual et al., 2005) and further results are expected soon fromhigh-throughput Y2H screens and mass spectrometry experiments (Warner et al.,2006).

3.2.6 The evolution of protein-protein interactions

The repertory of proteins is partially conserved between relatively closely relatedorganisms. Can the same be said of protein-protein interactions? Indeed, it seemsthat some groups of proteins, which belong to the same pathway, are all present,or all absent, from a genome (Pellegrini et al., 1999). Furthermore, many of theinteractions present in yeast appear to also be present in C. elegans, although thePPIN of the eukaryotic intracellular parasite Plasmodium falciparum shows littlesimilarity with the other eukaryotes (Suthram et al., 2005). Clearly, an understand-ing of how the protein interactions evolve is likely to elucidate the evolution of newfunctions.

Proteins which interact with each other in stable complexes evolve at similarrates (Fraser et al., 2002). Indeed, residues in interfaces of stable complexes evolveat a slow rate, and appear to co-evolve, i.e. a substitution in one protein leads toa complementary alteration in its neighbor, in order to preserve the functionalityof the interaction (Mintseris & Weng, 2005). For instance, a number of ligand-receptor pairs, where the interactions between the ligands and their receptors areoften obligate, have been shown to co-evolve (Goh & Cohen, 2002). In contrast,proteins which interact transiently show little evidence of co-evolution (Mintseris& Weng, 2005). However, although there is little evidence for co-evolution at theamino acid level between transiently interacting proteins, there is co-evolution ofexpression levels of interacting proteins (Fraser et al., 2004).

We may surmise that the level of co-evolution of two interacting proteins variesaccording to the stability of the interaction. Similarly, the evolutionary history of

15

Page 21: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

3 THE EVOLUTION OF BIOLOGICAL NETWORKS

proteins in complexes may be somewhat different from that of proteins involvedin transient interactions. Interestingly, a recent study (Pereira-Leal & Teichmann,2005) indicates that complete or partial module duplication, i.e. duplication ofprotein complexes, has occurred on several occasions during the evolution of S.

cerevisiae. For transient interaction pairs, it does not seem that the proteins evolvein a similar manner. In fact, after gene duplication events, interactions betweenproteins are often rapidly lost in an asymmetric manner, where one of the genesloses most of the original interactions (Wagner, 2001). The asymmetry is prominentin hubs, where divergence results in loss of sub-functions in the duplicated gene,whereas duplicates of proteins with few interaction partners are more likely to gainnew functions (Zhang et al., 2005).

3.3 On the structure of biological networks

Metabolic networks and other complex networks such as the film actor collabora-tion network, the world wide web, protein domain networks and protein-proteininteraction networks are small-world networks with some properties which are con-sistent with scale-free networks (Jeong et al., 2000; Wagner & Fell, 2001; Wuchty,2001; Apic et al., 2001a). The small-worldliness of the metabolic network of E. coli

has recently been contested for an alternative network representation where carbonatomic traces in metabolic reactions were used (Arita, 2004). Furthermore, sincethe coverage of the real PPIN is low, it has been questioned whether the topologyof the PPIN can currently be correctly identified (Han et al., 2005).

A small-world network is characterized by 1) short path lengths between anytwo nodes in the network and 2) a high clustering coefficient, which means thatthe neighbors of a certain node of the network are often connected to each otherthereby forming clusters, see Figure 2. A scale-free network, in this context, hasa power-law connectivity (degree, k) distribution, i.e. there are many nodes thathave very low connectivities and a handful of nodes with much higher connectivities(hubs), see Figure 6. Scale-free networks are robust networks in the sense that theyoften remain intact when a large fraction of randomly chosen nodes is eliminatedfrom the network (Albert et al., 2000). However, if a small fraction of the hubs ofthe network is eliminated the network is likely to become fragmented into severalcomponents.

3.3.1 What is the advantage of scale-freeness?

There are at least two possible reasons for why scale-free topologies are observedin many biological networks. First, the scale-free topology may give the organisma selective advantage over organisms with other network topologies. It has beensuggested that the scale-free character of biological networks has evolved throughnatural selection for the advantage of robustness and error-tolerance that the scale-free network topology confers to the organism (Jeong et al., 2001). Second, the

16

Page 22: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

3 THE EVOLUTION OF BIOLOGICAL NETWORKS

Scale−free network Random network

k=6k=6

b)a)

�������������������������������������������������

�������������������������������������������������

�������������������������������������������������

�������������������������������������������������

�������������������������������������������������

�������������������������������������������������

Figure 6: Scale-free networks. a) A schematic scale-free network. The hub of the network (striped)has the highest connectivity (k=8) while most of the nodes have quite low connectivities. Thelongest path (high-lighted) between any two nodes in the network is 4 steps. b) A schematicrandom network. This network has the same number of nodes and edges as the network in figurea, but the connectivity distribution is different. The highest connectivity in the network (k=3) issimilar to the average connectivity (k=1.9), and the longest path is rater long (9 steps).

scale-free topology could be a side effect of an evolutionary process.Two possible selective advantages that the scale-free network topology may con-

fer to the organism have been brought forward. First, the scale-free topology of thenetworks makes them robust against random mutations since most nodes in the net-work only are connected to one or two other nodes and the deletion of such a nodewill not significantly affect the structure of the network. However, the integrity ofthe scale-free network is susceptible to attacks on the hubs. In accordance with thisnotion, proteins with high connectivities in the protein-protein interaction networkof S. cerevisiae are more likely than other proteins to be essential proteins (Jeonget al., 2001) and, accordingly, the average connectivity of essential proteins is twiceas high as that of non-essential proteins (Bader & Hogue, 2002). Recently, however,a substantial correlation between connectivity and essentiality could not be foundin the PPIN of neither yeast nor human (Gandhi et al., 2006; Coulomb et al., 2005).It is possible that the discrepancy is due to disparate choices of protein-protein in-teraction databases. Second, scale-free networks have short average path lengths,i.e. there usually exists a short path between any two nodes in the network, which

17

Page 23: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

3 THE EVOLUTION OF BIOLOGICAL NETWORKS

could enable the network to shift quickly from one state to another in response toenvironmental changes (Jeong et al., 2000).

3.3.2 How do biological scale-free networks evolve?

Networks with scale-free properties have been shown to evolve when two simple rulesare applied: 1) The network grows by the addition of new nodes. 2) Preferentialattachment: New nodes are more likely to become connected to well connectednodes in the network than to low connectivity nodes (Barabasi & Albert, 1999).While preferential attachment is often at the root of scale-freeness, a network with apower-law degree distribution might be produced through other mechanisms as well.Preferential attachment in the context of genetic networks may take place partlythrough gene duplication (Qian et al., 2001; Barabasi & Oltvai, 2004). In agreementwith preferential attachment, proteins which have homologs in all 3 domains of life,which are likely to be of ancient origin, have high connectivities in the protein-protein interaction network of S. cerevisiae (Eisenberg & Levanon, 2003; Saeed &Deane, 2006). In contrast, Kunin et al (Kunin et al., 2004) showed that the mosthighly connected proteins date to after the evolution of primordial eukaryotes butbefore the radiation of eukaryotes to Plants, Metazoa and Protista.

3.3.3 The relationship between protein-protein interaction network topol-ogy and evolution

Even if the exact nature of the degree-distribution of the PPIN has not yet beenfirmly determined, and it is likely to vary depending on the definition of an interac-tion, it is clear that there are some highly connected proteins that are characterizedby certain properties. Hub proteins could be particularly interesting drug targets, forinstance in cancer therapy (Apic et al., 2005), where highly expressed hub proteinsin diseased tissues may be targeted.

The hubs of the PPIN of S. cerevisiae have been shown to evolve slowly, whichmay be because larger portions of the lengths of these proteins are directly involvedin their interactions (Hirsh & Fraser, 2001; Fraser et al., 2002). Other studiesindicate that the proposed negative correlation between evolutionary rate and con-nectivity is only due to a small fraction of proteins with high numbers of interactionswhich evolve slower than most proteins in the yeast network (Jordan et al., 2003).The difference between some of these studies seems to be due to the nature of thedata sets. When complexes identified with mass spectrometry based methods areincluded in the analysis the correlation between connectivity and evolutionary rateis clear (Fraser et al., 2003).

Based on expression profiles, two different hub types in the PPIN of S. cere-

visiae may be identified; static hubs (party hubs) and dynamic hubs (date hubs)(Han et al., 2004). The party hubs reside in static complexes where they interactwith most of their partners at the same time, while date hubs bind their interac-

18

Page 24: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

3 THE EVOLUTION OF BIOLOGICAL NETWORKS

tion partners at different times and/or locations. Party hubs are probably centralparts of functional modules/complexes while date hubs act as the mediators be-tween these semi-autonomous modules, see Figure 7. Thus, date hubs appear to bemore important than party hubs for the topology of the network (Han et al., 2004).Interestingly, they are often younger than party hubs (Fraser, 2005), indicating thatthere is a greater variability between species with regard to these connectors. Fur-ther, while there is no substantial difference between the proportion of essentialproteins among the party and date hubs, perturbation of the date hubs leads tosusceptibility of the genome to further perturbations (Han et al., 2004).

Estimates show that a quarter of the gene deletions in S. cerevisiae which are notassociated with a changed phenotype, are compensated by gene duplicates (Gu et al.,2003). On the other hand, a large number of genes without duplicates have beenshown to have no detectable effect on the phenotype upon deletion (Wagner, 2000),thus indicating that the lack of redundancy does not seem to result in susceptibility.However, if these singletons are young proteins at the peripheries of the network,perhaps on average less important than other proteins, it may not be surprising thattheir deletion does not lead to detrimental effects. Since hub proteins are pivotal forthe robustness of the PPIN, it is conceivable that duplicates, or partial duplicates,of the hubs, rather than other proteins, may be particularly beneficial. However,such hub duplicates may be rare since gene duplications often cause an imbalancein the concentration of the components of protein-protein complexes which mightbe deleterious (Veitia, 2002; Papp et al., 2003).

3.3.4 Is scale-freeness of biological importance?

There are several reasons for believing that scale-freeness is of little biological rele-vance. First, the observed topology of the protein-protein interaction network canevolve, without selective pressure network topology, through gene duplication anddeletion of links between proteins (Wagner, 2003; Amoutzias et al., 2004; van Noortet al., 2004). Second, since the metabolic networks studied in Jeong et al’s analysis(Jeong et al., 2000) include ubiquitous compounds such as water, NADH and ATP,it is dubious whether the similar average path lengths of various organisms meansanything else than that water, ATP and NADH are ubiquitous and therefore dom-inate the structure of the compound network in all these organisms. Indeed whenthe most ubiquitous compounds are removed from the network different organismshave quite different average path lenghts (Ma & Zeng, 2003).

Finally, and perhaps most importantly, it is possible that all large chemical net-works have scale-free topologies. To investigate this possibility Wagner (Wagner,2004) used planetary reaction chemistry data compiled in another study (Gleisset al., 2001). This data includes the chemical reactions found in the atmospheresof Earth, Venus and Jupiter. The resulting networks also have scale-free topolo-gies, indicating that natural selection is not necessary to explain the developmentof scale-free topologies. It is therefore highly arguable whether natural selection

19

Page 25: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

3 THE EVOLUTION OF BIOLOGICAL NETWORKS

(a)

(b) (c)

Figure 7: Party hubs and date hubs. a) The nodes represent the subset of eukaryotic specificproteins of S. cerevisiae and the edges are physical interactions as determined by the core datasetfrom DIP (Salwinski et al., 2004). Party hubs (green nodes) are highly connected proteins (hubs)which are co-expressed with their neighbors, while date hubs (yellow nodes) are hubs whose neigh-bors are expressed at different times. The grey nodes represent the proteins that are neither partyhubs nor date hubs. b) The date hub subnetwork from a is characterized by a high degree ofinterconnectivity. c) The party hub subnetwork shows compartmentalization. These images weredrawn with BioLayout (Goldovsky et al., 2005)

20

Page 26: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

4 TECHNIQUES AND DATABASES

molds the global structure of cellular networks because of selective benefits of thescale-free topology. Instead, the scale-free distribution of compounds in the plane-tary atmosphere may lead to a scale-free distribution of connectivity in metabolicnetworks, where the compounds are used in the biological system in proportion totheir prevalence.

4 Techniques and databases

Here, some of the techniques and databases from Papers I-IV are described. First,the representation frameworks for the metabolic and protein-protein interaction net-works are discussed. Second, the methods used for estimating protein age and do-main assignments are explained.

4.1 Representation of the metabolic network

Metabolic pathway information is available in several different databases such asEcoCyc (Karp et al., 2002), WIT (Overbeek et al., 2000), BRENDA (Schomburget al., 2002) and KEGG (Kanehisa et al., 2002). EcoCyc is an E. coli K-12 MG1655specific database and contains the metabolic complement of E. coli. EcoCyc hasan advantage over the two other databases in that it contains information aboutthe directionality of the enzymatic reactions. Furthermore, some enzymes havenot yet been classified according to the Enzyme Commission (EC) system (NC-IUBMB, 1992). Whereas, for instance, KEGG does not contain clear documentationof unclassified enzymatic reactions, EcoCyc contains both EC-classified enzymes andenzymes that fall outside of the EC classification.

The classic representation of metabolic reactions depict enzymes as edges, whilecompounds represent the nodes, see Figure 3. In recent years, however, the reverserepresentation, often referred to as ’protein-centric’ graphs (Gerrard et al., 2001) or’reaction graphs’ (Wagner & Fell, 2001), of the network has become more commonsince it is particularly well suited for comparative studies of the enzymes relativeto their connections and positions in the network. The work presented herein relieson the protein-centric network representation. An edge represents one or morecompounds. There will be an edge leading from an enzyme E1 to an enzyme E2if E1 catalyzes a reaction where compound A is produced and E2 takes A as asubstrate. Reversible reactions, such as A � B, are separated into two reactions, A→ B and B → A. There can be at most one edge in each direction between a pairof enzymes, see Figure 8. In addition, some reactions are physiologically irreversibleand should be represented accordingly. Therefore, we used a directed graph torepresent the metabolic network of E. coli. As a result, although the network is afully connected graph, there are not necessarily paths from every enzyme node inthe network to every other enzyme.

21

Page 27: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

4 TECHNIQUES AND DATABASES

8−amino−7−oxononanoate

7,8−diaminononanoate

2.6.1.622.3.1.47

2.8.1.6 6.3.3.3

CO2

Dethiobiotin

Figure 8: Protein-centric representation of the biotin metabolism. In the protein-centric repre-sentation enzymes are nodes and reactants are edges. EC 2.3.1.47 is a neighbor of EC 2.6.1.62because there is an edge leading from EC 2.6.1.62 to EC 2.3.1.47. The edges representing CO2,between EC 2.3.1.47 and EC 6.3.3.3, are eliminated when the 20 most promiscuous compoundsare removed from the network.

One of the most problematic aspects with metabolic network analysis is howthe promiscuous compounds should be dealt with. It could be argued that sincepromiscuous compounds are usually not the limiting factors of reactions, the networkwould become more biochemically meaningful if these compounds were removed (Ma& Zeng, 2003). However, to our knowledge there is no generally accepted criterionto determine the biological relevance of a compound in a reaction. Therefore, inthe studies presented here, a simple network-based criterion was applied. We countthe number of times a compound occurs as part of an edge in the network. Themost common compounds are then considered as promiscuous compounds and areremoved from the analysis (Alves et al., 2002).

4.2 Representation of the protein-protein interaction net-work

Protein-protein interaction networks are often represented with proteins as nodesand the physical interactions between the proteins as edges. While most studies ofPPINs often utilize this representation, the choice of database is less obvious. Itis intuitively appealing to build a PPIN representation from various resources, i.e.low-throughput experiments, affinity purification methods and two-hybrid methods.However, therein lies a risk of defusing the meaning of ”physical interaction”. Forinstance, it is often unclear which proteins interact with each other within a largeprotein complex. Most investigations have made use of either the ’spoke’ model,where the bait protein interacts with all the other proteins in the complex but there

22

Page 28: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

4 TECHNIQUES AND DATABASES

are no interactions between the preys, or the ’matrix’ model, where all proteins in acomplex are assumed to interact with each other (Bader & Hogue, 2002). Of course,these two models are diametrically opposed, and neither one of them is likely tobe representative of the real relationships between proteins in complexes, althoughestimates show that the spoke model is threefold more accurate than the matrixmodel (Bader & Hogue, 2002).

In the study presented here the representation of the PPIN was built usingthe ’core’ data set from the database of interacting proteins (DIP) (Deane et al.,2002; Salwinski et al., 2004) downloaded in March 2005. DIP contains low-and-high-throughput experimentally verified interactions, including two-hybrid and TAPstudies. The core data set consists of the interactions in DIP, which, by computa-tional methods, are deemed reliable. First, the expression profile reliability indexis derived from the expression correlations of interactions in DIP to those of knowninteracting and non-interacting pairs. Second, the DIP data set is verified throughparalogous verification, whereby an interaction pair is considered more reliable if ithas mutually interacting paralogs within the yeast interactome.

The connectivity of a protein node is defined as the number of proteins it is con-nected to, see Figure 2. The proteins were grouped according to their connectivitiesin the core interaction network. Hubs are defined as proteins with eight or moreinteractions while proteins with less than four interactions are labeled non-hubs andthe rest are intermediately connected. As previously mentioned, the physical in-teractions of proteins vary a great deal, which is especially obvious when the hubproteins are considered, since some of the hubs are at the cores of stable proteincomplexes, whereas others are involved in quite varied interactions. In fact, hubsmay be classified into party hubs (PH), believed to be at the cores of modules, anddate hubs (DH), believed to be the connectors between these modules (Han et al.,2004). Co-expression profiles from five different conditions (stress response (Gasch& Werner-Washburne, 2002), cell cycle (Spellman et al., 1998), pheromone treat-ment (Roberts et al., 2000), sporulation (Chu et al., 1998) and unfolded proteinresponses (Travers et al., 2000)) can be used to determine the average Pearson cor-relation coefficient (PCC) between each hub and its interaction partners. Party hubsare defined as proteins that show high average PCC with their interaction partners.

4.3 Domain assignment and protein age

Protein domains are the structural, functional and evolutionary units which to-gether, or one by one, form proteins. For the purposes of Papers III and IV thedomains from Pfam-A (Sonnhammer et al., 1997) were assigned by HMMER-2.0to various genomes. We determined the number of domains in proteins by addingthe number of Pfam-A, Pfam-B domains and unassigned regions longer than 100residues (orphan domains). The proteins that only consist of orphan domains arereferred to as orphans (Rost, 2002). Orphan proteins, have no homologs, or veryfew homologs which are only present in closely related species, and do not contain

23

Page 29: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

4 TECHNIQUES AND DATABASES

any known domains.In Paper IV, protein age was estimated based on the domain contents. The

domains were then grouped into those present in i) eukaryotes and prokaryotes, ii)eukaryotes only, iii) S. cerevisiae and/or Schizosaccharomyces pombe (yeast) andiv) no other species (orphans). In order to avoid bias toward repeating domains,which are common in eukaryotes, each domain family represented in the proteincontributed equally to the age class of the protein, i.e. repeating domains were onlycounted once. For example, a six domain protein consisting of four ancient domainsA, one orphan domain B and one eukaryotic domain C is 1/3 ancient, 1/3 orphanand 1/3 eukaryotic. As an alternative estimate of protein age, the S. cerevisiae

proteins were assigned to orthologous groups using clusters of orthologous groups(COG) (Tatusov et al., 2000) for prokaryotic orthologs and KOG (Tatusov et al.,2003) for eukaryotic orthologs.

Furthermore, in Paper IV we sought to classify paralogous pairs according to thethe approximate time of the duplication. For this purpose, KOG, complemented bytwo-species groups (TWOGs) and lineage-specific extensions (LSEs), were used tofind inparalogs (Remm et al., 2001), see Figure 1. Inparalogs are defined hereas paralogous proteins that appear to have originated since the split between S.

cerevisiae and S. pombe, believed to have occurred roughly 1,100 Myr ago (Heckmanet al., 2001). Among the inparalogs, there is a subset of paralogous pairs thatoriginate from the whole genome duplication (WGD), which is believed to have takenplace in S. cerevisiae about 100 Myr ago. These paralogs, ohnologs, were identifiedby the Yeast Gene Order Browser (Byrne & Wolfe, 2005). Although the inparalogdataset probably includes paralogs which arose quite recently, it is reasonable toassume, that the ohnolog subset represents some of the youngest inparalogs. As analternative measure, paralogs were defined as sequences with an e-value below 10−5

and an alignment covering more than 40% of the longest of the compared sequences.In addition, proteins of at least two domains and identical Pfam-A or Pfam-A+Bdomain architectures were also considered paralogous if their lengths did not differby more than 30% of the longer sequence.

4.4 Statistical tests

For Papers I, II and IV randomized networks were generated in order to estimate thestatistical significance of the results. This was done through shuffling the distributionof the property under investigation while preserving the network topology, see Figure9. Subsequently, Z-scores were calculated. For example, consider Paper IV, where wesought to determine the significance of the distribution of inparalogs between partyhubs and other proteins. The connectivities of the proteins were randomized, butthe topology preserved, and the number of inparalogs in the different connectivityclasses (party hubs, date hubs, intermediately connected, non-hubs) were calculated.In this case, the randomization process was iterated 10,000 times and the averageand standard deviation from the randomizations were used to calculate the Z-score

24

Page 30: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

5 THE PRESENT INVESTIGATIONS

Z = (n− r)/stdev(r), where n is the real number of inparalogs in a connectivity classk and r is the result from the randomized network. The Z-score then expresses howfar the average number of inparalogs belonging to a certain connectivity group differsfrom the average number of inparalogs of randomly sampled proteins, measured inunits of the random sampling distribution’s standard deviation. The larger the Z-score, the less likely that the difference between the real and random networks is bychance.

a)Original network

b)

E

A

D

C

G

F

B

H

Shuffled network

C

D

A

E

H

B

F

G

Figure 9: Preservation of network topology. a) The figure shows the nodes and edges in the originalexample graph. b) The figure shows the nodes and edges for the randomized graph where the nodeidentities (A, B, C, D, E, F, G, H) have been shuffled. The topological properties of the originalgraph are preserved in the randomized graph.

5 The present investigations

The investigations in this thesis revolve around the evolution of biological networks.The initial two papers concern the well studied metabolic networks, first with regardto two long standing competing explanations for the evolution of metabolism (PaperI) and, second, from a network topology perspective (Paper II). Furthermore, PaperIII describes a study estimating the proportion of different domain rearrangementsin proteins. Finally, Paper IV concerns the characteristics of the proteins with manyinteraction partners in the protein-protein interaction network of S. cerevisiae, withregard to their domain compositions, interaction partners and evolutionary history.

25

Page 31: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

5 THE PRESENT INVESTIGATIONS

5.1 Retrograde vs Patchwork evolution (Paper I)

The two most common models for the evolution of metabolism are patchwork evo-lution (Jensen, 1976), where enzymes are thought to diverge from broad to narrowsubstrate specificity (see Figure 5), and retrograde evolution (Horowitz, 1945), ac-cording to which enzymes evolve backwards, compared to the direction of the path-way, in response to substrate depletion, see Figure 4. Analysis of the distribution ofhomologous enzyme pairs in the metabolic network can shed light on the respectiveimportance of the two models. In this study, we investigate the evolution of themetabolism in E. coli K-12 MG1655 viewed as a single network using EcoCyc.

The first step in this investigation was to determine if there is an overrepresen-tation of homologous enzymes at short distances in the network. To accomplish thistask, sequence comparison between all enzyme pairs was performed and the minimalpath length (MPL) between all enzyme pairs was determined. In the real network58 % of all homologous pairs are found at MPL=1, compared to 18 % in the ran-domized networks. However, in this network all compounds, including ATP, NADHand NADPH (promiscuous compounds), were included. Although this network iscompletely devoid of biochemical bias (or knowledge) about the ’importance’ of thevarious reactants, one could certainly argue that the short distances in the networkbetween all enzymes that catalyze reactions involving ATP are not of substantialbiological relevance. Furthermore, from a purely methodological perspective, theinclusion of promiscuous compounds confounds the study in that the network be-comes very dense if the promiscuous compounds are included. To remedy suchcomplications we removed the 20 most promiscuous compounds and all cofactors,as defined by EcoCyc. However, the overrepresentation of homologous enzymes atMPL=1 remained after these revisions, and is therefore unlikely to be the result ofcommon cofactor-binding domains alone.

The retrograde evolution model predicts that homologous enzyme pairs will befound at close distances in the metabolic network. Like previous studies (Alveset al., 2002; Rison et al., 2002) our study seems to lend at least some support to theretrograde evolution model. To investigate this further we analyzed the homologouspairs of enzymes in order to distinguish between on one hand cases of patchwork evo-lution (broad-to-narrow evolution of enzyme substrate specificity) and on the otherhand cases of retrograde evolution (evolution of a different reaction mechanism).We found that the correlation between homology and network distance at MPL 1is mostly due to homologous enzyme pairs of similar functions (same primary ECnumber). These homologous enzyme pairs with analogous functions have probablyevolved through patchwork evolution. However, there is a statistically significantover-representation of functionally dissimilar (different primary EC numbers) homol-ogous enzyme pairs at MPL 1 which cannot easily be explained by the patchworkevolution model. In conclusion, our study indicates that while retrograde evolutionmay have played a small part, patchwork evolution is the predominant process ofmetabolic enzyme evolution.

26

Page 32: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

5 THE PRESENT INVESTIGATIONS

5.2 Preferential attachment in metabolic networks (PaperII)

While the investigation described in Paper I sought to determine the influence oftwo well established theories regarding the local evolution of metabolic networks,this study brings the investigation to the network level. It has been suggested thatthe scale-free properties, that some biological networks may display, could arise asa result of preferential attachment of new nodes to highly connected nodes throughgene duplication (Qian et al., 2001; Barabasi & Oltvai, 2004), see Figure 10. Pref-erential attachment by gene duplication may take place according to the followingscenario; Initially, the product of the duplicated gene has exactly the same functionand position in the network as its template. Since many proteins are connected tothe hub of the network, the duplicated protein is by chance likely to be connected tothe hub. Subsequently, the duplicate may evolve towards another functionality butit could retain some of its original function. Indeed, it has been shown that geneduplication and gene loss are sufficient to explain the scale-freeness of biologicalnetworks (Wagner, 2003; Amoutzias et al., 2004; van Noort et al., 2004). Nonethe-less, investigations of the predictions generated from preferential attachment on realbiological networks are of interest, since it supports the notion that preferential at-tachment, most likely through gene duplication, is part of the explanation for thepower-law connectivity distribution seen in several biological networks. Such an in-vestigation was conducted previously on the protein-protein interaction network ofS. cerevisiae (Eisenberg & Levanon, 2003). Similarly, we have investigated two pre-dictions generated from the mechanism of preferential attachment in the evolutionof the metabolic network of E. coli K-12.

First, if preferential attachment is of any significance in the evolution of themetabolic network of E. coli, the older enzymes in the network should have a higheraverage connectivity. We have found that E. coli enzymes which are representedin three domains of life, and in eukaryotes but not Archaea, have a higher averageconnectivity than expected by chance. Second, another prediction generated fromthe hypothesis is that highly connected nodes should gain more new edges thannodes with low connectivities. To investigate this prediction we extracted the en-zymes with representatives in 3 domains of life and constructed a representation ofLUCA’s metabolic network. Indeed, we found a positive linear correlation betweenconnectivity in the ancient network and number of connections gained through evo-lution.

Further, we found that the E. coli enzymes which are believed to have undergonehorizontal gene transfer have a higher average connectivity than other enzymes.This result suggests that the highly connected enzymes are often old in the sensethat they are likely to have originated in LUCA and been part of the bacterialmetabolic repertory for a long time, but are sometimes relatively recent additionsto the metabolic network of E. coli K-12. It is possible that bacteria such as E. coli

are adjusting their metabolic networks to a changing environment by replacing the

27

Page 33: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

5 THE PRESENT INVESTIGATIONS

��������������������

��������������������

���������������������

���������������������

�������������������������

�������������������������

�������������������������

�������������������������

��������

��������

�������������������������

�������������������������

� � � � � � ��������������

����������������������������������������������������������������������������������������������

Original sequence

After duplication of one gene�����������������������������������

�������������������������

���������������

���������������

�������������������������

�������������������������

�������������������������

�������������������������

�����������������������������������

�������������������������

���������������

���������������

�����������������������������������

�������������������������

k=4

k=5

Figure 10: Preferential attachment. If one gene is duplicated by chance among the five genes shownin the figure (by patterns), it is likely to be connected to the hub. Therefore, highly connectedproteins tend to gain new connections faster than other proteins.

relatively central enzymes, with high connectivities, for better adapted orthologsfrom other prokaryotic species.

5.3 Domain distance as an alternative homology measure

(Paper III)

Here, the evolution of multi-domain proteins was studied in terms of domain fusionsand repetitions. First, we developed the novel measure domain distance, defined asthe number of domains which differ between two domain architectures (DAs). Sec-ond, we determined the proportion of indels, repetitions and exchanges of domainsin Pfam (Sonnhammer et al., 1997) and SCOP (Bairoch & Apweiler, 1996).

In this study evolutionary events are categorized into three different classes;repetitions, indels (insertions/deletions) and domain exchanges. While domains canbe deleted from a protein, it has been shown that more proteins are created fromfusion (insertion) than fission (deletion) (Snel et al., 2000) and that fusions are fourtimes more common (Kummerfeld & Teichmann, 2005). Therefore, although it isunclear which of the two proteins appeared first, we assume that each multi–domainarchitecture stems from another architecture which is shorter or of the same length.

We found that indels are the most common events, and in contrast, exchangesof domains are quite rare. Repetitions are less frequent than expected, as 55%

28

Page 34: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

5 THE PRESENT INVESTIGATIONS

(eukaryotic dataset) of the multi–domain architectures contain repeating domains.Therefore it is possible that our method may overestimate the fraction of indels atthe expense of repetitions.

Next, we sought to determine the position of the domain rearrangements. Ourfindings show that indels and repetitions seldom occur in the middle of a protein,and arise equally at the N-and C-terminals. Two–domain architectures are oftencreated from two single–domain architectures and since 43-49% of the DAs in ourdataset consist of two–domain architectures, this could largely explain why we getsimilar fractions at both ends, and few events in the middle of the protein. Further, aneighbour with domain distance one could be found for 78-88% of the multi–domainarchitectures. Hence, most events can be explained by additions of a single domain.Finally, insertion/deletion of several domains at once is uncommon and most eventsinvolving several domains are repetitions.

In conclusion, using domain distance we quantified the different evolutionaryevents leading to complex domain architectures. From our investigations, it is clearthat indels are the most common rearrangement events, followed by repetitions.Further, the majority of the events involves the addition of a single domain, exceptin the case of long tandem repeats, where cassettes of a number of domains aresometimes duplicated simultaneously.

5.4 The characteristics of hubs in the protein-protein in-teraction network of Saccharomyces cerevisiae (PaperIV)

In the final investigation presented in this thesis, we have determined some character-istics of highly connected (hub) proteins in the protein-protein interaction networkof S. cerevisiae. The connectivities of the proteins in the network were determinedusing the ’core’ dataset of the Database of Interacting Proteins (see Methods). It ispossible to distinguish two different hub types in the PPIN of S. cerevisiae; statichubs (party hubs) and dynamic hubs (date hubs) (Han et al., 2004). The party hubsare found in static complexes, while the date hubs bind their interaction partnersat different times and/or locations.

First, we investigated the impact of domain composition on connectivity. Onereason for the higher complexity of eukaryotes compared to prokaryotes is the in-creased number of domain combinations found in eukaryotes, where for examplebinding domains have been added to existing catalytic proteins (Ekman et al., 2005).The idea that multi-domain proteins can bind many different proteins is intuitivelyappealing. Indeed, our results show that the proportion of multi-domain proteinsin hubs is larger than the corresponding fraction in non-hubs, which are on averageshorter than hubs. Furthermore, hubs frequently contain repeated domains and, inaddition, date hubs contain more disordered regions than party hubs which suggeststhat disordered regions are particularly important for the flexible binding of date

29

Page 35: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

5 THE PRESENT INVESTIGATIONS

hubs.Second, we wanted to investigate how specialized the hubs are in their binding;

Are these highly connected proteins hubs because they interact with many similarproteins, or because they are able to interact with many different partners withdiverse domain compositions? If hubs gained interactions through duplication oftheir neighbors, many neighbors would be paralogous to each other. Some complexeshave been formed by duplications (Pereira-Leal & Teichmann, 2005) but in general,interactions are often lost by one of the paralogs soon after duplication (Wagner,2001). Consistently, we have found that only a small fraction of the hub interactionscan be explained by interactions with paralogs.

A less stringent definition of homology is when two proteins share a domainfamily. A domain which is recurring in all neighbor proteins could also provide anecessary binding site, and might suggest at least a partial evolutionary relation-ship between the neighbors. Our findings indicate that the Pkinase and WD40domains are frequent among the domains contained in the hub interaction partners.Furthermore, in as many as 23% of the hubs, the most frequently shared domainin the interactors is shared with the hub itself, a feature almost twice as frequentin party hubs as in date hubs. These same domain interactions (SDI) are particu-larly frequent between proteins containing e.g. Pkinase, LSM, proteasome and AAAdomains. However, overall, our findings show that domain recurrence among hubinteraction partners can only explain part the hub interactions.

Finally, we investigated the paralogs of the party and date hubs. The protein-protein interaction network is susceptible to targeted attacks on the hubs of thenetwork (Jeong et al., 2001; Albert et al., 2000). Since hub proteins are pivotalfor the robustness of the protein-protein network, it is conceivable that there is anoverrepresentation of hub duplicates. In other words, hub duplicates could providegenetic redundancy where it is most crucially needed in the network, see Figure 11b.On the other hand, gene duplications may cause an imbalance in the concentrationof the components of protein-protein complexes which might be deleterious (Veitia,2002; Papp et al., 2003), see Figure 11a. The first mechanism predicts that thehubs should have a higher fraction of paralogs than other proteins. In contrast,the latter mechanism, which is sometimes referred to as dosage sensitivity, predictsthe opposite. We found that the duplicability of hub proteins is similar to that ofother proteins. However, there are very few static hub (party hub) paralogs whichoriginate from relatively recent duplications. We hypothesize that the number ofretained party hub duplicates has decreased relative to the duplicates of non-hubsduring the evolution of S. cerevisiae. Although other explanations are possible, wespeculate that the dosage sensitivity of party hubs has become more pronouncedcompared to other proteins during the evolution of S. cereivisae.

30

Page 36: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

6 CONCLUSIONS AND REFLECTIONS

a) b)Dosage sensitivity

100%

Hub redundancy

Duplicate hub

Hub deletion

Hub deletion

Single hub

Preserved topology

Disintegrated networkDuplicationA B +

10%90%

A A’ A B

Figure 11: Dosage sensitivity and hub redundancy. a) Dosage sensitivity for a simple complexconsisting of the proteins A and B. When A is duplicated, its duplicate A’ forms a homodimerwith A. As a result, there is a decrease in the number of AB complexes, which, if this is an importantinteraction in the cell, could be detrimental. b) Hub redundancy. The black node represents thehub in this simple network. If the hub is deleted, the network will be disintegrated. However, ifthere is a duplicate copy for this hub protein, the structure of the network will persist even if oneof the hubs is deleted.

6 Conclusions and reflections

Of the biological network studied herein, the wealth of knowledge of metabolismis likely to be more reliable than the information regarding protein-protein inter-action networks. However, knowledge of metabolic networks originates from themodel organisms E. coli and S. cerevisiae to a great extent, and is therefore subjectto substantial bias. Therefore, it can be expected that much will be learnt aboutmetabolism of higher eukaryotes in the years to come. Clearly, an important step willbe to relate the more than 1,500 known enzymatic activities whose correspondingenzymes have not yet been identified (Lespinet & Labedan, 2006). However, whilemuch is to be learnt about the metabolism, and its evolution, among the higher eu-karyotes, it is clear that far more remains to be discovered regarding protein-proteininteraction networks. Certainly, if many of the orphans correspond to functionswhich, like their sequences, are unique to their specific organisms, a steady streamof novel biological functions will be revealed for years to come.

Here, we have investigated the evolution of the metabolic network of E. coli,where we found that enzymes primarily evolve through duplication of enzymes withbroad substrate specificities, by gene duplication, to enzymes with more specializedsubstrate specificities. Further, we found some evidence supporting preferential

31

Page 37: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

6 CONCLUSIONS AND REFLECTIONS

attachment in metabolic networks, in that enzymes which are older have a higherconnectivity and enzymes which have a higher connectivity in a representation of theancient network, consisting of the enzymes which occur in all domains of life, tend togain more new edges than other enzymes. Studying protein evolution with respect todomain combinations, we found that the domain rearrangements which are the mostcommon in proteins are single indels taking place at the N-and C-terminals. Finally,we investigated the characteristics of the hub proteins of the yeast protein-proteininteraction network and found that repeating domains are quite common amongthe hubs, while disorder is particularly common in the hubs involved in transientinteractions (date hubs). Furthermore, the hubs which are at the cores of complexes(party hubs) have few recent paralogs compared to other proteins. This could, forexample, indicate that their paralogs diverge fast, which is contradicted by the resultthat party hubs have the slowest evolutionary rate of all proteins (Fraser, 2005), orby an increase in dosage sensitivity for party hubs compared to other proteins duringevolution.

There appears to be a possible contradiction in the most recent papers regard-ing the evolution of protein-protein networks. On one hand, dosage sensitivity hasbecome firmly established as a factor which restricts duplications of proteins whichare part of protein complexes (Veitia, 2002; Papp et al., 2003). On the other hand,many protein complexes show similarities to each other (Pereira-Leal & Teichmann,2005), which is hard to explain other than through gene, or genome, duplication.Genome duplication could be the best explanation for the evolution of complexes,since mutually paralogous complexes can evolve in this manner while dosage im-balance is avoided. However, our investigation in Paper IV shows that party hubduplicates are underrepresented among the paralogous pairs originating from thewhole genome duplication of yeast, suggesting that for these party hubs, the wholegenome duplication did not lead to a particularly relaxed dosage sensitivity. Furtherstudies are needed to explain the disparate observations, but it is possible that non-hub duplicates are frequently retained since they are generally more recent proteinswhere substitutions may lead to improvements.

32

Page 38: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

7 ACKNOWLEDGEMENTS

7 Acknowledgements

I would like to thank Stiftelsen for strategisk forskning for financial support. Fur-thermore, I’d sincerely like to thank the following people:

Arne Elofsson for being an encouraging, inspiring and just, simply, wonderful advi-sor.

Per Kraulis for recruiting me as a PhD candidate, teaching me a lot about biologyand for invaluable lessons in scientific argumentation.

Karin Melen for being a great friend; for all the fun talks, always helping out and,most of all, for your contagiously positive personality.

Diana Ekman and Asa K. Bjorklund for being amazing co-workers in every way.

Olivia Eriksson and Ann-Charlotte Berglund Sonnhammer for all the dinner partiesand for many inspiring scientific discussions.

To everyone in Arne’s group and all the people at SBC, past and present, for makingSBC a great place to be.

Anna-Karin for your friendship - and all the fun pop culture talks, not to mentionthose Buffy episodes.

My aunt Birgit for helping me when I most needed it.

My sister Maria for always being up for a heart-to-heart fashion photography chat.You are my twin spirit.

Mamma Ulla och Pappa Krille. Thank you for all your support during these years.I can safely say that this thesis would never have been completed without either oneof you - at least three times over. You are the best.

Bill, for always sharing your independent thoughts with me, teaching me so much Ididn’t know and for putting up with my long work days during the last year. I loveyou.

33

Page 39: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

REFERENCES

References

Albert, R., Jeong, H. and Barabasi, A. L. 2000. Error and attack tolerance ofcomplex networks. Nature 406:378–382.

Aloy, P. and Russell, R. B. 2002. The third dimension for protein interactions andcomplexes. Trends Biochem Sci 27:633–638.

Alves, R., Chaleil, R. A. and Sternberg, M. J. 2002. Evolution of enzymes inmetabolism: a network perspective. J Mol Biol 320:751–770.

Amoutzias, G. D., Robertson, D. L., Oliver, S. G. and Bornberg-Bauer, E. 2004.Convergent evolution of gene networks by single-gene duplications in highereukaryotes. EMBO Rep 5:274–279.

Apic, G., Gough, J. and Teichmann, S. A. 2001a. An insight into domain combina-tions. Bioinformatics 17:S83–S89.

Apic, G., Gough, J. and Teichmann, S. A. 2001b. Domain combinations in archaeal,eubacterial and eukaryotic proteomes. J Mol Biol 310:311–325.

Apic, G., Ignjatovic, T., Boyer, S. and Russell, R. B. 2005. Illuminating drugdiscovery with biological pathways. FEBS Lett 579:1872–1877.

Arita, M. 2004. The metabolic world of Escherichia coli is not small. Proc Natl

Acad Sci USA 101:1543–1547.

Bader, G. D. and Hogue, C. W. 2002. Analyzing yeast protein-protein interactiondata obtained from different sources. Nat Biotechnol 20:991–997.

Bairoch, A. and Apweiler, R. 1996. The swiss-prot protein sequence data bank andits new supplement trembl. Nucleic Acids Res 24:21–25.

Ball, C. A. and Cherry, J. M. 2001. Genome comparisons highlight similarity anddiversity within eukaryotic genomes. Curr Opin Chem Biol 5:86–89.

Barabasi, A. L. and Albert, R. 1999. Emergence of scaling in random networks.Science 286:509–512.

Barabasi, A-L. and Oltvai, Z. N. 2004. Network biology: understanding the cell’sfunctional organization. Nat Rev Genet 5:101–113.

Belfaiza, J., Parsot, C., Martel, A., de la Tour, C. B, Margarita, D., Cohen, G. N.and Saint-Girons, I. 1986. Evolution in biosynthetic pathways: two enzymescatalyzing consecutive steps in methionine biosynthesis originate from a com-mon ancestor and possess a similar regulatory region. Proc Natl Acad Sci USA

83:867–871.

34

Page 40: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

REFERENCES

Boucher, Y., Douady, C. J., Papke, R. T., Walsh, D. A., Boudreau, M. E., Nesbo,C. L., Case, R. J. and Doolittle, W. F. 2003. Lateral gene transfer and theorigins of prokaryotic groups. Annu Rev Genet 37:283–328.

Byrne, K. P. and Wolfe, K. H. 2005. The yeast gene order browser: combiningcurated homology and syntenic context reveals gene fate in polyploid species.Genome Res 15:1456–1461.

Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P. O. andHerskowitz, I. 1998. The transcriptional program of sporulation in buddingyeast. Science 282:699–705.

Claverie, J. M. 2001. Gene number. what if there are only 30,000 human genes?Science 291:1255–1257.

Conant, G. C. and Wolfe, K. H. 2006. Functional partitioning of yeast co-expressionnetworks after genome duplication. PLoS Biol 4:epub.

Consortium., International Human Genome Sequencing 2001. Initial sequencing andanalysis of the human genome. Nature 409:860–921.

Copley, R. R. and Bork, P. 2000. Homology among (β/α)8 barrels: implications forthe evolution of metabolic pathways. J Mol Biol 303:627–641.

Cornell, M., Paton, N. W. and Oliver, S. G. 2004. A critical and integrated view ofthe yeast interactome. Comp Func Genom 5:382–402.

Coulomb, S., Bauer, M., Bernard, D. and Marsolier-Kergoat, M. C. 2005. Geneessentiality and the topology of protein interaction networks. Proc Biol Sci

272:1721–1725.

Dandekar, T., Snel, B., Huynen, M. and Bork, P. 1998. Conservation of gene order: afingerprint of proteins that physically interact. Trends Biochem Sci 23:324–328.

Date, S. V. and Marcotte, E. M. 2003. Discovery of uncharacterized cellular systemsby genome-wide analysis of functional linkages. Nat Biotechnol 21:1055–1062.

Deane, C. M., Salwinski, L., Xenarios, I. and Eisenberg, D. 2002. Protein interac-tions: two methods for assessment of the reliability of high throughput obser-vations. Mol Cell Proteomics 1:349–356.

Edwards, A. M., Kus, B., Jansen, R., Greenbaum, D., Greenblatt, J. and Gerstein,M. 2002. Bridging structural biology and genomics: assessing protein interac-tion data with known complexes. Trends Genet 18:529–536.

Eisenberg, E. and Levanon, E. Y. 2003. Preferential attachment in the proteinnetwork evolution. Phys Rev Lett 91:128701.

35

Page 41: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

REFERENCES

Ekman, D., Bjorklund, A. K., Frey-Skott, J. and Elofsson, A. 2005. Multi–domainproteins in the three kingdoms of life - orphan domains and other unassignedregions. J Mol Biol 348:231–243.

Enright, A. J., Iliopoulos, I., Kyrpides, N. C. and Ouzounis, C. A. 1999. Proteininteraction maps for complete genomes based on gene fusion events. Nature

402:86–90.

Fani, R., Lio, P. and Lazcano, A. 1995. Molecular evolution of the histidine biosyn-thetic pathway. J Mol Biol 41:760–774.

Fields, S. 2005. High-throughput two-hybrid analysis. the promise and the peril.FEBS J 272:5391–5399.

Fields, S. and Song, O. 1989. A novel genetic system to detect protein-proteininteractions. Nature 340:245–246.

Force, A., Lynch, M. B., Pickett, B., Amores, A. and Yan, Y.-L. 1999. Preservation ofdupicate genes by complementary, degenerative mutations. Genetics 151:1531–1545.

Fraser, H. B. 2005. Modularity and evolutionary constraint on proteins. Nat Genet

37:351–352.

Fraser, H. B., Hirsh, A. E., Steinmetz, L. M., Scharfe, C. and Feldman, M. W. 2002.Evolutionary rate in the protein interaction network. Science 296:750–752.

Fraser, H. B., Hirsh, A. E., Wall, D. P. and Eisen, M. B. 2004. Coevolution of geneexpression among interacting proteins. Proc Natl Acad Sci USA 101:9033–903.

Fraser, H. B., Wall, D. P. and Hirsh, A. E. 2003. A simple dependence betweenprotein evolution rate and the number of protein-protein interactions. BMC

Evol Biol 3.

Gandhi, T. K., Zhong, J., Mathivanan, S., Karthick, L., Chandrika, K. N., Mohan,S. S., Sharma, S., Pinkert, S., Nagaraju, S., Periaswamy, B., Mishra, G., Nan-dakumar, K., Shen, B., Deshpande, N., Nayak, R., Sarker, M., Boeke, J. D.,Parmigiani, G., Schultz, J., Bader, J. S. and Pandey, A. 2006. Analysis of thehuman protein interactome and comparison with yeast, worm and fly interac-tion datasets. Nat Genet 38:285–293.

Gasch, A. P. and Werner-Washburne, M. 2002. The genomics of yeast responses toenvironmental stress and starvation. Funct Integr Genomics 2:181–182.

Gavin, A. C., Bosche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A., Schultz,J., Rick, J. M., Michon, A. M., Cruciat, C. M., Remor, M., Hofert, C., Schelder,M., Brajenovic, M., Ruffner, H., Merino, A., Klein, K., Hudak, M., Dickson, D.,

36

Page 42: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

REFERENCES

Rudi, T., Gnau, V., Bauch, A., Bastuck, S., Huhse, B., Leutwein, C., Heurtier,M. A., Copley, R. R., Edelmann, A., Querfurth, E., Rybin, V., Drewes, G.,Raida, M., Bouwmeester, T., Bork, P., Seraphin, B., Kuster, B., Neubauer, G.and Superti-Furga, G. 2002. Functional organization of the yeast proteome bysystematic analysis of protein complexes. Nature 415:141–147.

Gerrard, J. A., Sparrow, A. D. and Wells, J. A. 2001. Metabolic databases - whatnext? Trends Biochem Sci 26:137–140.

Gerstein, M. and Levitt, M. 1998. Comprehensive assessment of automatic structuralalignment against a manual standard, the scop classification of proteins. Protein

Sci 7:445–456.

Giot, L., Bader, J. S., Brouwer, C., Chaudhuri, A., Kuang, B., Li, Y., Hao, Y. L.,Ooi, C. E., Godwin, B., Vitols, E., Vijayadamodar, G., Pochart, P., Machi-neni, H., Welsh, M., Kong, Y., Zerhusen, B., Malcolm, R., Varrone, Z., Collis,A., Minto, M., Burgess, S., McDaniel, L., Stimpson, E., Spriggs, F., Williams,J., Neurath, K., Ioime, N., Agee, M., Voss, E., Furtak, K., Renzulli, R., Aa-nensen, N., Carrolla, S., Bickelhaupt, E., Lazovatsky, Y., DaSilva, A., Zhong, J.,Stanyon, C. A., Finley, R. L., White, K. P., Braverman, M., Jarvie, T., Gold,S., Leach, M., Knight, J., Shimkets, R. A., McKenna, M. P., Chant, J. andRothberg, J. M. 2003. A protein interaction map of Drosophila melanogaster.Science 302:1727–1736.

Gleiss, P. M., Stadler, P. F., Wagner, A. and Fell, D. A. 2001. Relevant cycles inchemical reaction networks. Adv Complex Syst 4:207–226.

Goffeau, A., Barrell, B. G., Bussey, H., Davis, R. W., Dujon, B., Feldmann, H.,Galibert, F., Hoheisel, J. D., Jacq, C, Johnston, M., Louis, E. J., Mewes,H. W., Murakami, Y., Philippsen, P., Tettelin, H. and Oliver, S. G. 1996. Lifewith 6000 genes. Science 274:563–567.

Goh, C. S. and Cohen, F. E. 2002. Co-evolutionary analysis reveals insights intoprotein-protein interactions. J Mol Biol 324:177–192.

Goldovsky, L., Cases, I., Enright, A. J. and Ouzounis, C. A. 2005. Biolayout(java):versatile network visualisation of structural and functional relationships. Ap-

plied Bioinformatics 4:71–74.

Goto, S., Okuno, Y., Hattori, M., Nishioka, T. and Kanehisa, M. 2002. LIGAND :database of chemical compounds and reactions in biological pathways. Nucleic

Acids Res 30:402–404.

Grigoriev, A. 2001. A relationship between gene expression and protein interac-tions on the proteome scale: analysis of the bacteriophage t7 and the yeastSaccharomyces cerevisiae. Nucleic Acids Res 29:3513–3519.

37

Page 43: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

REFERENCES

Gu, Z., Steinmetz, L. M., Gu, Z., Scharfe, C., Davis, R. W. and Li, W. H. 2003.Role of duplicate genes in genetic robustness against null mutations. Nature

421:63–66.

Han, J. D., Bertin, N., Hao, T., Goldberg, D. S., Berriz, G. F., Zhang, L. V., Dupuy,D., Walhout, A. J., Cusick, M. E., Roth, F. P. and Vidal, M. 2004. Evidencefor dynamically organized modularity in the yeast protein-protein interactionnetwork. Nature 430:88–93.

Han, J. D., Dupuy, D., Bertin, N., Cusick, M. E. and Vidal, M. 2005. Effect ofsampling on topology predictions of protein-protein interaction networks. Nat

Biotechnol 23:839–844.

Heckman, D. S., Geiser, D. M., Eidell, B. R., Stauffer, R. L., Kardos, N. L. andHedges, S. B. 2001. Molecular evidence for the early colonization of land byfungi and plants. Science 293:1129–1133.

Hirsh, A. E. and Fraser, H. B. 2001. Protein dispensability and rate of evolution.Nature 411:1046–1049.

Ho, Y., Gruhler, A., Heilbut, A., Bader, G. D., Moore, L., Adams, S. L., Millar,A., Taylor, P., Bennett, K., Boutilier, K., Yang, L., Wolting, C., Donaldson, I.,Schandorff, S., Shewnarane, J., Vo, M., Taggart, J., Goudreault, M., Muskat,B., Alfarano, C., Dewar, D., Lin, Z., Michalickova, K., Willems, A. R., Sassi,H., Nielsen, P. A., Rasmussen, K. J., Andersen, J. R., Johansen, L. E., Hansen,L. H., Jespersen, H., Podtelejnikov, A., Nielsen, P. A., Crawford, J., Poulsen,V., Sorensen, B. D., Matthiesen, J., Hendrickson, R. C., Gleeson, F., Pawson,T., Moran, M. F., Durocher, D., Mann, M., Hogue, C. W., Figeys, D. andTyers, M. 2002. Systematic identification of protein complexes in Saccharomyces

cerevisiae by mass spectrometry. Nature 415:180–183.

Horowitz, N. H. 1945. On the evolution of biochemical syntheses. Proc Natl Acad

Sci USA 31:153–157.

Horowitz, N. H. 1965. The evolution of biochemical syntheses - retrospect andprospect. In Evolving genes and proteins, (Bryson, V. and Vogel, H. J., eds),pp. 15–23, Academic Press, New York.

Huynen, M. A., Dandekar, T. and Bork, P. 1999. Variation and evolution of thecitric-acid cycle: a genomic perspective. Trends Microbiol 7:281–291.

Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M. and Sakaki, Y. 2001. Acomprehensive two-hybrid analysis to explore the yeast protein interactome.Proc Natl Acad Sci USA 98:4277–4278.

38

Page 44: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

REFERENCES

Jansen, R., Greenbaum, D. and Gerstein, M. 2002. Relating whole-genome expres-sion data with protein-protein interactions. Genome Res 12:37–46.

Jensen, R. A. 1976. Enzyme recruitment in evolution of new function. Annu Rev

Microbiol 30:409–425.

Jeong, H., Mazon, S. P., Barabasi, A. L. and Oltvai, Z. N. 2001. Lethality andcentrality in protein networks. Nature 411:41–42.

Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N. and Barabasi, A. L. 2000. Thelarge-scale organization of metabolic networks. Nature 407:651–654.

Jin, F., Hazbun, T., Michaud, G. A., Salcius, M., Predki, P. F., Fields, S. and Huang,J. 2006. A pooling-deconvolution strategy for biological network elucidation.Nat Methods 3:183–189.

Jordan, I. K., Wolf, Y. I. and Koonin, E. V. 2003. No simple dependence betweenprotein evolution rate and the number of protein-protein interactions: only themost prolific interactors tend to evolve slowly. BMC Evol Biol 3.

Kanehisa, M., Goto, S., Kawashima, S. and Nakaya, A. 2002. The KEGG databasesat GenomeNet. Nucleic Acids Res 30:42–46.

Karp, P. D., Riley, M., Saier, M., Paulsen, I. T., Collado-Vides, J., Paley, S. M.,Pellegrini-Toole, A., Bonavides, C. and Gama-Castro, S. 2002. The ecocycdatabase. Nucleic Acids Res 30:56–58.

Kelley, B. P., Sharan, R., Karp, R. M., Sittler, T., Root, D. E., Stockwell, B. R. andIdeker, T. 2003. Conserved pathways within bacteria and yeast as revealed byglobal protein network alignment. Proc Natl Acad Sci USA 100:11394–11399.

Kellis, M., Birren, B. W. and Lander, E. S. 2004. Proof and evolutionary analysisof ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature

428:617–624.

Kemmeren, P., van, Berkum, Vilo, J., Bijma, T., Donders, R., Brazma, A. andHolstege, F. C. 2002. Protein interaction verification and functional annotationby integrated analysis of genome-scale data. Mol Cell 9:1133–1143.

Kummerfeld, S. K. and Teichmann, S. A. 2005. Relative rates of gene fusion andfission in multi-domain proteins. Trends Genet 21:25–30.

Kunin, V., Pereira-Leal, J. B. and Ouzounis, C. A. 2004. Functional evolution ofthe yeast protein interaction network. Mol Biol Evol 21:1171–1176.

Kurland, C. G., Canback, B. and Berg, O. G. 2003. Horizontal gene transfer: acritical view. Proc Natl Acad Sci USA 100:9658–9662.

39

Page 45: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

REFERENCES

Lawrence, J. G. and Ochman, H. 1997. Amelioration of bacterial genomes: rates ofchange and exchange. J Mol Evol 44:383–397.

Lawrence, J. G. and Roth, J. R 1996. Selfish operons: horizontal transfer may drivethe evolution of gene clusters. Genetics 143:1843–1860.

Lazcano, A. and Miller, S. L. 1996. The origin and early evolution of life: prebioticchemistry, the pre-rna world, and time. Cell 85:793–798.

Lespinet, O. and Labedan, B. 2006. Puzzling over orphan enzymes. Cell Mol Life

Sci 63:517–523.

Li, S., Armstrong, C. M., Bertin, N., Ge, H., Milstein, S., Boxem, M., Vidalain,P. O., Han, J. D., Chesneau, A., Hao, T., Goldberg, D. S., Li, S., Martinez,M., Rual, J. F., Lamesch, P., Xu, L., Tewari, M., Wong, S. L., Zhang, L. V.,Berriz, G. F., Jacotot, L., Vaglio, P., Reboul, J., Hirozane-Kishikawa, T., Li,S., Gabel, H. W., Elewa, A., Baumgartner, B., Rose, D. J., Yu, H., Bosak,S., Sequerra, R., Fraser, A., Mango, S. E., Saxton, W. M., Strome, S., Van,D., Piano, F., Vandenhaute, J., Sardet, C., Gerstein, M., Doucette-Stamm, L.,Gunsalus, K. C., Harper, J. W., Cusick, M. E., Roth, F. P., Hill, D. E. andVidal, M. 2004. A map of the interactome network of the metazoan C. elegans.Science 303:540–543.

Liu, J. and Rost, B. 2004. Chop proteins into structural domain-like fragments.PROTEINS: Structure, Function and Bioinformatics 55:678–688.

Lynch, M. and Conery, J. S. 2000. The evolutionary fate and consequences ofduplicate genes. Science 290:1151–1155.

Ma, H. and Zeng, A. P. 2003. Reconstruction of metabolic networks from genomedata and analysis of their global structure for various organisms. Bioinformatics

19:270–277.

Marcotte, E. M., Pellegrini, M., Ng, H. L., Rice, D. W., Yeates, T. O. and Eisen-berg, D. 1999. Detecting protein function and protein-protein interactions fromgenome sequences. Science 285:751–753.

Mewes, H. W., Frishman, D., Guldener, U., Mannhaupt, G., Mayer, K., Mokrejs,M., Morgenstern, B., Munsterkotter, M., Rudd, S. and Weil, B. 2002. Mips: adatabase for genomes and protein sequences. Nucleic Acids Res 30:31–34.

Mintseris, J. and Weng, Z. 2005. Structure, function, and evolution of transientand obligate protein-protein interactions. Proc Natl Acad Sci USA 102:10930–10935.

Mrowka, R., Patzak, A. and Herzel, H. 2001. Is there a bias in proteome research?Genome Res 11:1971–1973.

40

Page 46: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

REFERENCES

NC-IUBMB 1992. Enzyme Nomenclature. Academic Press, San Diego, California.

Ohno, S. 1970. Evolution by gene duplication. Springer-Verlag, Berlin.

Overbeek, R., Larsen, N., Pusch, G. D., D’Souza, M., Jr. Selkov, E., Kyrpides,N., Fonstein, M., Maltsev, N. and Selkov, E. 2000. WIT : integrated systemfor high-throughput genome sequence analysis and metabolic reconstruction.Nucleic Acids Res 28:123–125.

Pal, C., Papp, B. and Lercher, M. J. 2005. Adaptive evolution of bacterial metabolicnetworks by horizontal gene transfer. Nature Genet 37:1372–1375.

Papp, B., Pal, C. and Hurst, L. D. 2003. Dosage sensitivity and the evolution ofgene families in yeast. Nature 424:194–197.

Pellegrini, M., Marcotte, E. M., Thompson, M. J., Eisenberg, D. and Yeates, T. O.1999. Assigning protein functions by comparative genome analysis: proteinphylogenetic profiles. Proc Natl Acad Sci USA 96:4285–4288.

Peregrin-Alvarez, J. M., Tsoka, S. and Ouzounis, C. A. 2003. The phylogeneticextent of metabolic enzymes and pathways. Genome Res 13:422–427.

Pereira-Leal, J B and Teichmann, S A 2005. Novel specificities emerge by stepwiseduplication of functional modules. Genome Res 15:552–559.

Petsko, G. A., Kenyon, G. L., Gerlt, J. A., Ringe, D. and Kozarich, J. W. 1993. Onthe origin of enzymatic species. Trends Biochem Sci 18:372–376.

Qian, J., Luscombe, N. M. and Gerstein, M. 2001. Protein family and fold occurrencein genomes: power-law behaviour and evolutionary model. J Mol Biol 313:673–681.

Remm, M., Storm, C. E. and Sonnhammer, E. L. 2001. Automatic clusteringof orthologs and in-paralogs from pairwise species comparisons. J Mol Biol

314:1041–1052.

Rigaut, G., Shevchenko, A., Rutz, B., Wilm, M., Mann, M. and Seraphin, B. 1999.A generic protein purification method for protein complex characterization andproteome exploration. Nat Biotechnol 17:1030–1032.

Rison, S. C., Teichmann, S. A. and Thornton, J. M. 2002. Homology, pathway dis-tance and chromosomal localisation of the small molecule metabolism enzymesin Escherichia coli. J Mol Biol 318:911–932.

Roberts, C. J., Nelson, B., Marton, M. J., Stoughton, R., Meyer, M. R., Bennett,H. A., He, Y. D., Dai, H., Walker, W. L., Hughes, T. R., Tyers, M., Boone,C. and Friend, S. H. 2000. Signaling and circuitry of multiple mapk pathwaysrevealed by a matrix of global gene expression profiles. Science 287:873–880.

41

Page 47: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

REFERENCES

Rost, B. 2002. Did evolution leap to create the protein universe? Curr Opin Struct

Biol 12:409–416.

Rual, J. F., Venkatesan, K., Hao, T., Hirozane-Kishikawa, T., Dricot, A., Li, N.,Berriz, G. F., Gibbons, F. D., Dreze, M., Ayivi-Guedehoussou, N., Klitgord,N., Simon, C., Boxem, M., Milstein, S., Rosenberg, J., Goldberg, D. S., Zhang,L. V., Wong, S. L., Franklin, G., Li, N., Albala, J. S., Lim, J., Fraughton, C.,Llamosas, E., Cevik, S., Bex, C., Lamesch, P., Sikorski, R. S., Vandenhaute,J., Zoghbi, H. Y., Smolyar, A., Bosak, S., Sequerra, R., Doucette-Stamm, L.,Cusick, M. E., Hill, D. E., Roth, F. P. and Vidal, M. 2005. Towards a proteome-scale map of the human protein-protein interaction network. Nature 437:1173–1178.

Saeed, R. and Deane, C. M. 2006. Protein protein interactions, evolutionary rate,abundance and age. BMC Bioinformatics 7:128.

Salwinski, L., Miller, C. S., Smith, A. J., Pettit, F. K., Bowie, J. U. and Eisenberg,D. 2004. The database of interacting proteins: 2004 update. Nucleic Acids Res

32:D449–D451.

Saqi, M. A. and Sternberg, M. J. 2001. A structural census of metabolic networksfor E. coli. J Mol Biol 313:1195–1206.

Schomburg, I., Chang, A. and Schomburg, D. 2002. Brenda, enzyme data andmetabolic information. Nucleic Acids Res 30:47–49.

Schuster, S., Fell, D. A. and Dandekar, T. 2000. A general definition of metabolicpathways useful for systematic organization and analysis of complex metabolicnetworks. Nat Biotechnol 18:326–332.

Snel, B., Bork, P. and Huynen, M. 2000. Genome evolution. gene fusion versus genefission. Trends Genet 16:9–11.

Snel, B., Bork, P. and Huynen, M. A. 2002. Genomes in flux: the evolution ofarchaeal and proteobacterial gene content. Genome Res 12:17–25.

Sonnhammer, E. L., Eddy, S. R. and Durbin, R. 1997. Pfam: a comprehensivedatabase of protein families based on seed alignments. Proteins 28:405–420.

Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B.,Brown, P. O., Botstein, D. and Futcher, B. 1998. Comprehensive identificationof cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarrayhybridization. Mol Biol Cell 9:3273–3297.

Sprinzak, E., Sattath, S. and Margalit, H. 2003. How reliable are experimentalprotein-protein interaction data? J Mol Biol 327:919–923.

42

Page 48: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

REFERENCES

Suthram, S., Sittler, T. and Ideker, T. 2005. The plasmodium protein networkdiverges from those of other eukaryotes. Nature 438:108–112.

Tatusov, R., Fedorova, N. D., Jackson, J. D., Jacobs, A. R., Kiryutin, B., Koonin,E. V., Krylov, D. M., Mazumder, R., Mekhedov, S. L., Nikolskaya, A. N., Rao,B. S., Smirnov, S., Sverdlov, A. V., Vasudevan, S., Wolf, Y. I., Yin, J. J. andNatale, D. A. 2003. The cog database: an updated version includes eukaryotes.BMC Bioinformatics 4:41.

Tatusov, R. L., Galperin, M. Y., Natale, D. A. and Koonin, E. V. 2000. The cogdatabase: a tool for genome-scale analysis of protein functions and evolution.Nucleic Acids Res 28:33–36.

Teichmann, S. A., Rison, S. C. G., Thornton, J. M., Riley, M., Gough, J. andChothia, C. 2001. The evolution and structural anatomy of the small moleculemetabolic pathways in Escherichia coli. J Mol Biol 311:693–708.

Thornton, J. W. 2001. Evolution of vertebrate steroid receptors from an ancestralestrogen receptor by ligand exploitation and serial genome expansions. Proc

Natl Acad Sci USA 98:5671–5676.

Travers, K. J., Patil, C. K., Wodicka, L., Lockhart, D. J., Weissman, J. S. andWalter, P. 2000. Functional and genomic analyses reveal an essential coordina-tion between the unfolded protein response and er-associated degradation. Cell

101:249–258.

Tsoka, S. and Ouzounis, C. A. 2001. Functional versatility and molecular diversityof the metabolic map of Escherichia coli. Genome Res 11:1503–1510.

Uetz, P. and Finley, R. L. Jr. 2005. From protein networks to biological systems.FEBS Lett 579:1821–1827.

Uetz, P., Giot, L., Cagney, G., Mansfield, T. A., Judson, R. S., Knight, J. R.,Lockshon, D., Narayan, V., Srinivasan, M., Pochart, P., Qureshi-Emili, A., Li,Y., Godwin, B., Conover, D., Kalbfleisch, T., Vijayadamodar, G., Yang, M.,Johnston, M., Fields, S. and Rothberg, J. M. 2000. A comprehensive analysis ofprotein-protein interactions in Saccharomyces cerevisiae. Nature 403:623–627.

van Noort, V., Snel, B. and Huynen, M. A. 2004. The yeast coexpression network hasa small-world, scale-free architecture and can be explained by a simple model.EMBO Rep 5:280–284.

Veitia, R. A. 2002. Exploring the etiology of haploinsufficiency. Bioessays 24:175–184.

Venter, J. C. and et al 2001. The sequence of the human genome. Science 291:1304–1351.

43

Page 49: Investigations into the evolution of biological networks189160/FULLTEXT01.pdf · Investigations into the evolution of biological networks Sara Light, sara@sbc.su.se Stockholm Bioinformatics

REFERENCES

von Mering, C., Zdobnov, E. M., Tsoka, S., Ciccarelli, F. D., Pereira-Leal, J. B.,Ouzounis, C. A. and Bork, P. 2003. Genome evolution reveals biochemicalnetworks and functional modules. Proc Natl Acad Sci USA 100:15428–15433.

Wagner, A. 2000. Mutational robustness in genetic networks of yeast. Nature Gen

24:355–361.

Wagner, A. 2001. The yeast protein interaction network evolves rapidly and containsfew redundant duplicate genes. Mol Biol Evol 18:1283–1292.

Wagner, A. 2003. How large protein interaction networks evolve. Proc R Soc Lond

B 270:457–466.

Wagner, A. 2004. The connectivity of large genetic networks: design, history, ormere chemistry? To appear in ’Power laws, scale-free networks and genomebiology.’ by Karev, Wolf and Koonin.

Wagner, A. and Fell, D. A. 2001. The small world inside large metabolic networks.Proc R Soc Lond B Biol Sci 268:1803–1810.

Warner, G. J., Adeleye, Y. A. and Ideker, T. 2006. Interactome networks: the stateof the science. Genome Biol 7:301.

Wilmanns, M., Hyde, C. C., Davies, D. R., Kirschner, K. and Jansonius, J. N.1991. Structural conservation in parallel beta/alpha-barrel enzymes that cat-alyze three sequential reactions in the pathway of tryptophan biosynthesis. Bio-

chemistry 30:9161–9169.

Wuchty, S. 2001. Scale-free behavior in protein domain networks. Mol Biol Evol

18:1694–1702.

Zhang, Z., Luo, Z. W., Kishino, H. and J., Kearsey M. 2005. Divergence pattern ofduplicate genes in protein-protein interactions follows the power law. Mol Biol

Evol 22:501–505.

44