conservation and coevolution in the scale-free human gene

13
Conservation and Coevolution in the Scale-Free Human Gene Coexpression Network I. King Jordan, Leonardo Marin ˜ o-Ramı ´rez, Yuri I. Wolf, and Eugene V. Koonin National Center for Biotechnology Information, National Institutes of Health Bethesda, Maryland The role of natural selection in biology is well appreciated. Recently, however, a critical role for physical principles of network self-organization in biological systems has been revealed. Here, we employ a systems level view of genome- scale sequence and expression data to examine the interplay between these two sources of order, natural selection and physical self-organization, in the evolution of human gene regulation. The topology of a human gene coexpression network, derived from tissue-specific expression profiles, shows scale-free properties that imply evolutionary self- organization via preferential node attachment. Genes with numerous coexpressed partners (the hubs of the coexpression network) evolve more slowly on average than genes with fewer coexpressed partners, and genes that are coexpressed show similar rates of evolution. Thus, the strength of selective constraints on gene sequences is affected by the topology of the gene coexpression network. This connection is strong for the coding regions and 39 untranslated regions (UTRs), but the 59 UTRs appear to evolve under a different regime. Surprisingly, we found no connection between the rate of gene sequence divergence and the extent of gene expression profile divergence between human and mouse. This suggests that distinct modes of natural selection might govern sequence versus expression divergence, and we propose a model, based on rapid, adaptation-driven divergence and convergent evolution of gene expression patterns, for how natural selection could influence gene expression divergence. Introduction The recent genomic sequencing efforts yielded detailed lists of genes and the proteins that they encode (Lander et al. 2001; Waterston et al. 2002), and functional genomics studies have gone a step further by elucidating many of the processes and interactions that individual proteins are involved in (Ho et al. 2002; Giot et al. 2003; Kamath et al. 2003; Li et al. 2004). Building on the success of these high-throughput experimental approaches, a synthetic view of how individual genes and proteins emerge and act collectively to carry out the business of the cell is needed to facilitate a deeper understanding of biological function and evolution. Systems-based ap- proaches to biology seek to meet this challenge by emphasizing the patterns and processes that govern how collections of biological molecules are assembled and ordered (Pennisi 2003). The agent that probably has been most often invoked to explain the ordering of biological systems over time is natural selection (Darwin 1859; Li 1997). Genome-scale studies on natural selection have detailed many of the factors that mitigate the effects of selection on the evolution of gene sequences. Such surveys rely on comparisons between evolutionary rates, which yield information about the action of natural selection, and various quantifiable functional genomic parameters. For instance, several studies have demonstrated a relationship between gene evolutionary rates and the fitness effects associated with gene knockouts. Genes with greater fitness effects (e.g., essential genes) seem to evolve more slowly, on average, than genes with smaller fitness effects (Hirsh and Fraser 2001; Jordan et al. 2002). This is taken to suggest that essential genes evolve under stronger functional constraints and, thus, a more severe purifying selection regime, than nonessential genes. Similarly, genes that encode proteins involved in numerous protein-protein interactions have been reported to be more evolutionarily conserved than genes encoding less-prolific interactors (Fraser et al. 2002; Fraser, Wall, and Hirsh 2003). A recent study that dealt with several such relationships simultaneously demon- strated correlations between different measures of evolu- tionary conservation and various functional genomic parameters (Krylov et al. 2003). However, the findings of some of these evolutionary genomics studies have been challenged. The possibility that the observed effects of any one genomic parameter on evolutionary rates can be confounded by the correlations between different genomic parameters has been raised repeatedly. For example, some of the strongest correlations seen are between evolutionary rates and gene expression levels. Genes that are expressed at high levels and in numerous tissues tend to be more conserved than genes with lower and narrower expression patterns (Duret and Mouchiroud 2000; Pal, Papp, and Hurst 2001; Krylov et al. 2003; Zhang and Li 2004). When the effects of expression level are controlled for, the correlations between evolutionary rate and fitness effects as well as between evolutionary rate and the number of protein- protein interactions are mitigated (Bloom and Adami 2003; Pal, Papp, and Hurst 2003). Furthermore, when duplicate genes were removed from consideration, the relationship between fitness effects and evolutionary rate disappeared (Yang, Gu, and Li 2003). These controversies remain unsettled, and the general question of how various functional genomic parameters interact to effect evolu- tionary rate is open. In addition to natural selection, an emphasis has recently been placed on the role of fundamental physical principles in imposing order on biological systems (Barabasi and Oltvai 2004). Various complex biological systems have been abstracted as networks where the nodes in the network represent the individual parts, such as proteins or metabolites, and the links in the network represent the interactions between the parts (Jeong et al. Key words: Gene expression, Human evolution, Natural selection, Network, Self-organization, Substitution rate. E-mail: [email protected]. Mol. Biol. Evol. 21(11):2058–2070. 2004 doi:10.1093/molbev/msh222 Advance Access publication July 28, 2004 Molecular Biology and Evolution vol. 21 no. 11 Ó Society for Molecular Biology and Evolution 2004; all rights reserved. Downloaded from https://academic.oup.com/mbe/article/21/11/2058/1148025 by guest on 06 December 2021

Upload: others

Post on 09-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Conservation and Coevolution in the Scale-Free Human Gene

Conservation and Coevolution in the Scale-Free HumanGene Coexpression Network

I King Jordan Leonardo Marino-Ramırez Yuri I Wolf and Eugene V KooninNational Center for Biotechnology Information National Institutes of Health Bethesda Maryland

The role of natural selection in biology is well appreciated Recently however a critical role for physical principles ofnetwork self-organization in biological systems has been revealed Here we employ a systems level view of genome-scale sequence and expression data to examine the interplay between these two sources of order natural selection andphysical self-organization in the evolution of human gene regulation The topology of a human gene coexpressionnetwork derived from tissue-specific expression profiles shows scale-free properties that imply evolutionary self-organization via preferential node attachment Genes with numerous coexpressed partners (the hubs of the coexpressionnetwork) evolve more slowly on average than genes with fewer coexpressed partners and genes that are coexpressedshow similar rates of evolution Thus the strength of selective constraints on gene sequences is affected by the topologyof the gene coexpression network This connection is strong for the coding regions and 39 untranslated regions (UTRs)but the 59 UTRs appear to evolve under a different regime Surprisingly we found no connection between the rate ofgene sequence divergence and the extent of gene expression profile divergence between human and mouse This suggeststhat distinct modes of natural selection might govern sequence versus expression divergence and we propose a modelbased on rapid adaptation-driven divergence and convergent evolution of gene expression patterns for how naturalselection could influence gene expression divergence

Introduction

The recent genomic sequencing efforts yieldeddetailed lists of genes and the proteins that they encode(Lander et al 2001 Waterston et al 2002) and functionalgenomics studies have gone a step further by elucidatingmany of the processes and interactions that individualproteins are involved in (Ho et al 2002 Giot et al 2003Kamath et al 2003 Li et al 2004) Building on thesuccess of these high-throughput experimental approachesa synthetic view of how individual genes and proteinsemerge and act collectively to carry out the business of thecell is needed to facilitate a deeper understanding ofbiological function and evolution Systems-based ap-proaches to biology seek to meet this challenge byemphasizing the patterns and processes that govern howcollections of biological molecules are assembled andordered (Pennisi 2003)

The agent that probably has been most often invokedto explain the ordering of biological systems over time isnatural selection (Darwin 1859 Li 1997) Genome-scalestudies on natural selection have detailed many of thefactors that mitigate the effects of selection on the evolutionof gene sequences Such surveys rely on comparisonsbetween evolutionary rates which yield information aboutthe action of natural selection and various quantifiablefunctional genomic parameters For instance severalstudies have demonstrated a relationship between geneevolutionary rates and the fitness effects associated withgene knockouts Genes with greater fitness effects (egessential genes) seem to evolve more slowly on averagethan genes with smaller fitness effects (Hirsh and Fraser2001 Jordan et al 2002) This is taken to suggest thatessential genes evolve under stronger functional constraintsand thus a more severe purifying selection regime than

nonessential genes Similarly genes that encode proteinsinvolved in numerous protein-protein interactions havebeen reported to be more evolutionarily conserved thangenes encoding less-prolific interactors (Fraser et al 2002Fraser Wall and Hirsh 2003) A recent study that dealtwith several such relationships simultaneously demon-strated correlations between different measures of evolu-tionary conservation and various functional genomicparameters (Krylov et al 2003)

However the findings of some of these evolutionarygenomics studies have been challenged The possibilitythat the observed effects of any one genomic parameter onevolutionary rates can be confounded by the correlationsbetween different genomic parameters has been raisedrepeatedly For example some of the strongest correlationsseen are between evolutionary rates and gene expressionlevels Genes that are expressed at high levels and innumerous tissues tend to be more conserved than geneswith lower and narrower expression patterns (Duret andMouchiroud 2000 Pal Papp and Hurst 2001 Krylovet al 2003 Zhang and Li 2004) When the effects ofexpression level are controlled for the correlationsbetween evolutionary rate and fitness effects as well asbetween evolutionary rate and the number of protein-protein interactions are mitigated (Bloom and Adami2003 Pal Papp and Hurst 2003) Furthermore whenduplicate genes were removed from consideration therelationship between fitness effects and evolutionary ratedisappeared (Yang Gu and Li 2003) These controversiesremain unsettled and the general question of how variousfunctional genomic parameters interact to effect evolu-tionary rate is open

In addition to natural selection an emphasis hasrecently been placed on the role of fundamental physicalprinciples in imposing order on biological systems(Barabasi and Oltvai 2004) Various complex biologicalsystems have been abstracted as networks where the nodesin the network represent the individual parts such asproteins or metabolites and the links in the networkrepresent the interactions between the parts (Jeong et al

Key words Gene expression Human evolution Natural selectionNetwork Self-organization Substitution rate

E-mail kooninncbinlmnihgov

Mol Biol Evol 21(11)2058ndash2070 2004doi101093molbevmsh222Advance Access publication July 28 2004

Molecular Biology and Evolution vol 21 no 11 Society for Molecular Biology and Evolution 2004 all rights reserved

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

2000 2001 Luscombe et al 2002 Ravasz et al 2002)Studies of the statistical properties of the topologies ofsuch networks suggest specific mechanisms that governtheir evolution In particular many biological networksshow scale-free topological properties that can be ex-plained by a model of network growth via preferentialattachment of new nodes to existing nodes that are alreadyhighly connected (Barabasi and Albert 1999) At thegenomic level gene duplication is thought to underlie thephenomenon of preferential attachment (Rzhetsky andGomez 2001 Bhan Galas and Dewey 2002 Barabasi andOltvai 2004) Existing highly connected nodes (ie genesor proteins) are more likely simply by virtue of their largenumber of connections to be linked to nodes that areduplicated Because the duplicated nodes are expected tomaintain the same links as the ancestral singleton theconnectivity of a highly connected node will increase withduplication (Barabasi and Oltvai 2004) This process alonecan lead to network growth by preferential attachment Theubiquity of scale-free network topological patterns sug-gests that network growth by preferential attachment viamechanisms such as gene duplication is a fundamentaland conserved evolutionary process

In this work we attempted to integrate the twoperspectives of natural selection and physical self-organization to analyze the evolution of gene sequenceand expression patterns Comparison of human and mousegenome sequence data was combined with the analysis ofgene expression profiles that were derived from microarrayexperiments on a number of tissues in both speciesHuman gene expression profiles were used to reconstructa network of coexpressed genes and we demonstrate theeffect of the network topology on the strength of naturalselection as well as an unexpected relationship betweenhuman-mouse gene sequence and gene expression di-vergence These results underscore the influence ofexpression network self-organization on gene evolutionand suggest that distinct mechanisms are responsible forthe evolution of expression patterns and gene sequences

Materials and Methods

Human and mouse gene expression levels were takenfrom a recently published series of Affymetrix microarrayexperiments (Su et al 2002) and were retrieved from theGene Expression Omnibus database at the National Centerfor Biotechnology Information (NCBI) The two data setflat filesmdashGDS181soft (human) and GDS182soft(mouse)mdashwere downloaded from ftpftpncbinihgovpubgeodatagdssoft Affymetrix probe identifiers weremapped to individual loci in the human and mousegenomes using the LocusLink database (NCBI NIHBethesda) A total of 7383 human and 6724 mouse locicorresponded to the probe identifiers The results ofmicroarray experiments on cancerous tissues were re-moved to yield profiles of normal mammalian tran-scriptomes This left data from a total of 63 human and89 mouse microarray experiments that were subsequentlyanalyzed Levels of expression recorded as averagedifference (AD) values for redundant experiments (iestudies of the same tissue samples) were averaged before

analysis Because negative AD values represent morenoise than signal AD values were clipped at a value of 20For the purpose of determining the breadth of expressionan AD value of 200 was taken as a threshold to considera gene to be expressed in a given tissue (Su et al 2002)and the number of tissues where a gene was expressedwas counted For the purpose of determining the levelof expression the sum of the log2 normalized AD valuesover all tissues was taken The similarity between geneexpression patterns was determined using the Pearsoncorrelation coefficient (r) (Eisen et al 1998) Beforethis correlation analysis unexpressed genes (AD 200in all tissues) and nondifferentially expressed genes(maximum ADminimum AD 10) were removed fromthe data matrices In addition to control for the effects ofthe overall level of gene expression AD values werenormalized for each gene by taking the log2 value of theratio of the tissue-specific AD valuemedian AD value forall tissues

Pairwise correlations between gene expression pat-terns were used to derive the gene coexpression networkThe fit of the network node degree distributionmdashthat isthe frequency distribution of the number of genes f(n) thathave n coexpressed genesmdashto a theoretical distribution(the generalized Pareto distribution here) was done aspreviously described (Karev et al 2002 Koonin Wolfand Karev 2002) It should be noted that because theavailable node degree data span only approximately 25orders of magnitude a formal possibility remains that thecoexpression data are also compatible with models otherthan the scale-free model adopted here The clusteringcoefficient (C) was calculated for each node as the ratioof the number of the actual connections between theneighbors of the node to the number of possible con-nections between them (Barabasi and Oltvai 2004)

Human and mouse gene sequences were extractedfrom the RefSeq database (NCBI NIH Bethesda) Human-mouse orthologs were identified as reciprocal best Blast hitsbetween protein sequences as previously described (JordanWolf and Koonin 2003) Human-mouse orthologousprotein sequences were aligned using ClustalW (HigginsThompson and Gibson 1996) and the protein alignmentswere used to guide alignments of the correspondingnucleotide coding sequences to ensure that they werealigned in frame The 59 and 39 UTR sequences werealigned using ClustalW and only alignments where theshortest sequence had more than 20 residues and had nomore than 50 differences in the number of residues (ielength) between aligned sequences were used for furtheranalysis Synonymous (dS) and nonsynonymous (dN) sub-stitution rates were calculated for alignments of protein-coding sequences using the Nei-Gojobori method (Nei andGojobori 1986) implemented in the PAML package (Yang1997) The 59 and 39 UTR substitution rates (d) werecalculated using the Jukes-Cantor correction for multiplesubstitutions (Jukes and Cantor 1969) Paralogous geneswere identified using an all-against-all BlastP search of theproteins in the coexpression (r 07) network with ane-value threshold of 1025 and a coverage cutoff suchthat more than 50 of the shorter protein sequence had tobe included in the high scoring segment pair

Evolution and the Human Gene Coexpression Network 2059

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Comparisons between substitution rates and variousgene expression parameters (expression breadth expressionlevel and correlations) were done by sorting the rates orrate differences in the ascending order and then binning thesorted values into 10 equal-sized bins Average and frac-tional values of gene expression parameters for each sub-stitution rate bin were compared with respect to the order ofthe bins and the Spearman rank correlation (R) along withits statistical significance (P-value at n frac14 10 for all com-parisons) was computed for each comparison (table 1)Regular correlation coefficients (rXY) and partial correla-tion coefficients (rXYZ) were Fisher Zndashtransformed andthe significance of the differences between them was calcu-lated using a z-test

Phyletic patterns for human genes were taken fromthe eukaryotic Clusters of Orthologous Groups of proteins(KOGs) database (Tatusov et al 2003 Koonin et al 2004)For each KOG its phyletic pattern corresponds to thepresenceabsence state of the KOG member proteinsamong the seven eukaryotic species included in thedatabase For each phyletic pattern a specific evolutionaryscenario was determined by mapping the states (presenceor absence) and events (gain or loss) to the species treeof the seven organisms in KOGs using the Dollo parsi-mony method (Farris 1977) Pairs of scenarios were thencompared to derive values of the phyletic similaritymeasure This was done by scoring the combinations ofstates and events on each branch of the species tree andsumming across all branches For each branch the specificcombination of states and events for a pair of scenarioswas considered with respect to the branch length andthe branch specific propensities for gene gain and loss Thebranch lengths (MYR) on the species tree were taken asdescribed previously (Hedges et al 2001 Krylov et al2003) The branch-specific relative propensities for genegain (PG

i ) and gene loss (PLi ) were calculated using the

total number of gene gains and losses mapped to eachbranch (Koonin et al 2004) according to the followingformula

PGi frac14 Gi

Xj

Nj

Ni

Xj

Gj PLi frac14 Li

Xj

Nj

Ni

Xj

Lj

where Ni Gi and Li are the numbers of present genesgains and losses mapped to the ith branch The scoringschemes for all possible combinations of states and eventsare shown at ftpftpncbinihgovpubkooninJordanMBE-04-0138R1SupplementaryTable4doc The sum ofscores across all branch lengths was divided by the sumof the respective normalization scores (see table 4 inSupplementary Material online) to provide the patternsimilarity scores (range from 21 to 1)

Results and DiscussionExpression Level and Breadth VersusSequence Divergence

A recent large-scale microarray study that includesquantitative analysis of gene expression patterns for 31

human and 46 mouse tissues yielded detailed profiles oftwo mammalian transcriptomes (Su et al 2002) We usedthese expression data along with comparative human-mouse gene sequence analysis to evaluate the relationshipbetween gene sequence evolution and gene expressiondivergence on a genomic scale For each human-mouseorthologous gene pair the number of tissues where it isexpressed (expression breadth) and total level of expres-sion were determined (see Materials and Methods) Thesevalues were compared to several measures of human-mouse orthologous gene sequence divergence the synon-ymous (dS) and nonsynonymous (dN) protein-codingsequence substitution rates as well as the substitution rates(d) for the 59 and 39 untranslated regions (UTRs) Asreported previously (Duret and Mouchiroud 2000 PalPapp and Hurst 2001 Zhang and Li 2003 Zhang and Li2004) genes that are more widely expressed and genesthat are more highly expressed are more evolutionarilyconserved (ie evolve more slowly) than genes withnarrower and lower overall levels of expression (fig 1 andtable 1 and see supplementary figure 1) The negativecorrelation between expression levels and substitutionrates was more pronounced for dN than for dS (fig 1 andtable 1 and see supplementary figure 1) This suggests thatrelatively slow evolution of highly expressed genesdepends more on the strict functional constraints on theprotein structure than on adaptation of codon usage forexpression level The connection between gene expressionand evolutionary rate was far stronger for the 39 UTRs thanfor the 59 UTRs (fig 1 and table 1 and see supplementaryfigure 1) This seems to indicate that on average 39 UTRscontain more cis-regulatory sites that are functionallyconstrained in highly expressed genes than do 59 UTRsOverall the 59 and 39 UTRs have similar rates ofevolution the median of the ratio d(59 UTR)d(39 UTR)is approximately 112 (see supplementary figure 2a)Furthermore both 39 UTRs and 59 UTRs were found toevolve somewhat slower on average than the synonymouspositions (see supplementary figures 2b and c) Thus 59UTRs appear to evolve under functional constraints thatare nearly as strong as those that affect the 39 UTRs andeven stronger than those for the synonymous positions ofthe coding region However for the 59 UTRs theseconstraints appear to be unrelated to expression breadthand level as measured here

Human Gene Coexpression Network

Tissue-specific expression patterns of human geneswere further compared to identify coexpressed genes Foreach differentially expressed human gene the normalizedexpression levels in each of the 31 tissues were used toconstruct a vector which was compared with similarlyderived vectors of other human genes using the Pearsoncorrelation coefficient (r) (Eisen et al 1998) Pairs of geneswith high r-values are considered to be coexpressedValues of r were determined for all pairs of human genesThese data were used to infer a network of coexpressedgenes where the genes are nodes that are connected by anedge if they share an r-value greater than or equal to aspecified threshold This was done using a series of r-value

2060 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

expr

essi

on le

vel

180

190

200

210

220

230

240

250

015 021 026 030 034 038 042 047 053 069180

190

200

210

220

230

240

250

001 002 003 004 005 007 009 012 018 164180

190

200

210

220

230

240

250

039 045 049 053 056 060 065 071 081 598180

190

200

210

220

230

240

250

015 020 024 028 031 034 037 042 048 062

dSdN 3rsquo UTR d

expr

essi

on b

read

th

6

8

10

12

14

16

18

001 002 003 004 005 007 009 012 018 1646

8

10

12

14

16

18

039 045 049 053 056 060 065 071 081 5986

8

10

12

14

16

18

015 021 026 030 034 038 042 047 053 0696

8

10

12

14

16

18

015 020 024 028 031 034 037 042 048 062

a

b

5rsquo UTR d

dSdN 3rsquo UTR d 5rsquo UTR d

FIG 1mdashThe dependence between expression breadth expression level and substitution rates of human genes (andashd) Average (6 standard error) human gene expression breadth values for 10 ascendingbins of human-mouse orthologous gene substitution rates (endashh) Average (6 standard error) human gene expression level values for 10 ascending bins of human-mouse orthologous gene substitution rates

Evolutio

nandtheHuman

GeneCoexpressio

nNetw

ork

2061

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

thresholds (09 2 04) and the topological properties ofthe resulting series of networks were investigated byanalyzing their node degree distributions that is thefrequency distributions of the number of genes f(n) thathave n coexpressed genes As the r-value thresholdincreases the number of edges in the network decreasesand the node degree distribution seems to tend to a powerlaw distribution (see supplementary figure 3) Node degreedistributions and graphic representations of the corre-sponding network topologies for r 09 08 and 07 areshown in figure 2 Because the distribution for r 07shows a good fit to a distribution with a power law tailwhile still retaining enough data for meaningful statisticalanalysis (fig 2 and see supplementary figure 3) thethreshold of 07 was chosen for further analysis A listof coexpressed gene pairs at r 07 is shown at ftpftpncbinihgovpubkooninJordanMBE-04ndash0138R1SupplementaryTable1tab

The node degree distribution for r 07 displays agood fit to a generalized Pareto distribution where f(n) rsquo(n 1 134)2117 (fig 3a) This distribution has a power lawtail which implies asymptotically scale-free properties(Barabasi and Albert 1999) Such scale-free networks haveno single characteristic node degree and are dominated bya small number of highly connected hubs (Barabasi andAlbert 1999) in this case genes that are coexpressed withmany other genes Similar scale-free properties have beenpreviously detected for node degree distributions of severalother gene expressionndashrelated parameters For instance

gene expression levels (Hoyle et al 2002 KuznetsovKnott and Bonner 2002 Luscombe et al 2002) as well aschanges in gene expression level (Ueda et al 2004) havebeen shown to follow power law distributions Coexpres-sion networks derived with different techniques for severalcancers (Agrawal 2002) and for yeast plants and animals(Bhan Galas and Dewey 2002 Bergmann Ihmels andBarkai 2004) also show scale-free properties Finallya number of other biological and other evolving networksincluding protein-protein interactions metabolic networkssocial contacts networks and the Internet (Barabasi andAlbert 1999 Jeong et al 2000 2001 Ravasz et al 2002Barabasi and Oltvai 2004) have node degree distributionsthat follow a power law and thus imply scale-freeproperties The principal mode of evolution of scale-freenetworks is thought to be preferential attachment of newnodes to those that are already highly connected (ie lsquolsquotherich get richerrsquorsquo or lsquolsquothe fit get fitterrsquorsquo model [Barabasi andAlbert 1999])

Obviously there is a transitive property to thecorrelation coefficients that are used to measure coex-pression and connect nodes in the coexpression network Ifthe tissue-specific expression pattern of gene A is cor-related with that of gene B and the pattern of gene B iscorrelated with that of gene C then one should expectexpression of gene A to be correlated with that of gene CHowever the level of correlation between A and C is anopen question because some groups of genes can be tightlycoregulated whereas others are only loosely coexpressed

FIG 2mdashThe human gene coexpression network Node degree distributions and network topologies are shown for networks where coexpressedgenes (nodes) are linked if (a) r 09 (b) r 08 and (c) r 07

2062 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

To further evaluate the topological properties of the humangene coexpression network the clustering of networknodes was analyzed The clustering coefficient (C) ofa given node is the ratio of the number of the actualconnections between the neighbors of the node to thenumber of possible connections between them (Barabasiand Oltvai 2004) The average C for the gene coexpressionnetwork is 0452 This indicates a high degree of clusteringthat is typical of many scale-free networks (Barabasi andOltvai 2004) however the fact that C 1 indicates onlylimited transitivity in the gene coexpression networkanalyzed here The shape of the dependence of C on thenode degree is thought to be indicative of the mode ofnetwork growth (Barabasi and Oltvai 2004) Furthermorea plot of the clustering coefficient C(n) versus nodedegree (n) shows the absence of any clear trend betweenthe two (fig 3b) C(n) rsquo constant as seen here isconsistent with network growth by simple preferentialattachment of nodes rather than growth by the hierarchicaladdition of modules that is thought to be characterized byC(n) rsquo n21 (Barabasi and Oltvai 2004)

The human gene coexpression network was examinedto assess the functional relationships between coexpressedgenes Pairs of coexpressed genes were functionallyclassified using the eukaryotic KOG database (Tatusov etal 2003) and the Gene Ontology database (Ashburner etal 2000) Using both of these approaches approximately175 of coexpressed gene pairs were found to encode

proteins with the same functional classification a signifi-cant (P 00001) but relatively small excess over therandom expectation (13) However far more strikingnonrandomness was observed when the coexpressed geneswere examined for co-occurrence of different functionscertain combinations of seemingly related functions werestrongly preferred (ftpftpncbinihgovpubkooninJordanMBE-04ndash0138R1SupplementaryTable2tab) Forexample signal transduction genes are coexpressed withthose involved in secretion transcription and posttransla-tional protein modification much more often than expectedby chance This is likely to reflect coexpression of genescoding for proteins that function together in biochemicaland signaling pathways

A number of highly connected individual clustersfrom the human gene coexpression network were furtherexamined with respect to the expression patterns andfunctions of their member genes The majority of theseclusters are made up of genes with narrow if not exclusivetissue-specific expression patterns (see supplementaryfigure 4) Examples of the topologies of two of theseclusters along with their tissue-specific expression patternsare shown in figure 4 The experimentally determined andpredicted functions for the member genes of these twoclusters are generally consistent with their expressionpatterns and indicate their involvement in pancreatic andtestis-related physiological functions respectively (table 2)

Conservation and Coevolution of CoexpressedHuman Genes

For each human gene with an identifiable mouseortholog the number of other human genes that it iscoexpressed with at r 07 (ftpftpncbinihgovpubkooninJordanMBE-04-0138R1SupplementaryTable3tab)was compared with its human-mouse substitution rate(fig 5 and table 1) Genes that have a higher number ofcoexpressed genes (the node degree in the transcriptionexpression network) are substantially more conservedthan genes with fewer coexpressed partners Qualitativelyidentical results are seen when the mouse expression data isused to compare the number of coexpressed mouse geneswith human-mouse evolutionary rates (see supplementaryfigure 5a) As with the relationship between expressionlevel and evolutionary rate the negative correlationbetween the number of coexpressed genes and evolutionaryrates was strongest for dN (fig 5) Thus the proteinproducts of those genes that are hubs of the transcriptioncoexpression network are subject to comparatively highlevels of functional constraint which could reflect theiressential roles in cellular processes (Hirsh and Fraser 2001Jordan et al 2002 Krylov et al 2003) multifunctionalityandor involvement in a large number of protein-proteininteractions (Fraser et al 2002 Jordan Wolf and Koonin2003) The negative correlation between dS and d(39 UTR)and the number of coexpressed genes (fig 5) is likelyto result from purifying selection maintaining tight trans-lational regulation of genes that are highly connected in thecoexpression network In contrast the correlation betweend(59 UTR) and the number of coexpressed genes was notsignificant (fig 5 and table 1)

1

10

100

1000a

b

1 10 100 1000node degree

fre

qu

en

cy

f(n) = 622 (n + 134)-117

P(χ2) ~ 021

00102030405060708

1 10 100 1000node degree

clu

ste

rin

g c

oe

ffic

ie

nt

FIG 3mdashStatistical properties of the human gene coexpressionnetwork (a) Node degree distribution Number of genes f(n) that haveexactly n coexpressed genes Equation and line for the best fittingdistribution are shown Probability associated with the v2 test for thegoodness of fit of the observed data to the theoretical generalized Paretodistribution is shown (b) Clustering coefficient (y-axis) plotted againstthe node degree (x-axis) Average clustering coefficients and standarddeviations of the averages for node degree bins are shown

Evolution and the Human Gene Coexpression Network 2063

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Although the correlation between dN and the numberof coexpressed genes was nearly as strong as that betweendN and expression breadth and level for dS the lattercorrelation was substantially stronger (table 1) This is notsurprising because codon usage adaptation is particularlyimportant for highly expressed genes (Carbone Zinovyevand Kepes 2003) Partial correlation rXYZ where X frac14number of coexpressed genes Y frac14 substitution rate andZfrac14 expression level was used to control for the effects ofexpression level on the relationship between the number ofcoexpressed genes and substitution rates In all caseswhere the substitution rate was found to be significantlycorrelated with the number of coexpressed genesmdashdNdS and d(39 UTR) (table 1)mdashthe application of partialcorrelation did not result in any significant decrease of thecorrelation coefficient that is rXYZ is not significantlysmaller than rXY (038 P 056) Thus the effect ofthe number of coexpressed genes on substitution rates isnot based on any correlation between the number ofcoexpressed genes and the expression level

The relationship between gene coexpression andevolutionary rate was further examined by comparing ther-values between pairs of human gene expression patternswith their pairwise gene substitution rate differences (fig 6and table 1) The results show that bins of human geneswith similar values of dN between human and mouse (iethose genes that evolve at similar rates) have a greaterfraction of pairwise r-values 07 This negative

correlation was substantial and statistically significant(fig 6 and table 1) In contrast the pairwise dS differencebetween human genes did not correlate with the differencein expression patterns (fig 6) Thus genes with similarpatterns of expression tend to evolve at similar rates andthe selection that leads to such coevolution appears tooperate primarily at the protein sequence level The dif-ference in evolution rates of 59 and 39 UTRs also neg-atively correlated with pairwise r-values between humangene expression patterns although the effect statisticallysignificant because of the large number of analyzed genepairs (table 1) was not nearly as pronounced as seen fordN (fig 6) As with other comparisons between evolu-tionary rate and expression patterns the magnitude of thisrelationship was greater for 39 UTRs suggesting that thisregion is more functionally relevant than the 59 UTR withrespect to the pattern of gene expression Qualitativelyidentical results are seen when the mouse expression datawas used to compare r-values between pairs of mousegenes with pairwise human-mouse gene substitution ratedifferences (see supplementary figure 5b)

The connection between gene coexpression andevolutionary conservation on a longer timescale wasinvestigated by considering the expression data alongwith the patterns of phyletic distribution inferred fromthe recently developed collection of eukaryotic KOGs(Tatusov et al 2003 Koonin et al 2004) Each humangene was mapped to a specific KOG and assigned

FIG 4mdashTwo highly connected gene expression clusters form the human gene coexpression network (a) Nodes correspond to genes and arelabeled with LocusLink ID numbers All links between genes correspond to coexpression at r 09 (b) Tissue-specific gene expression patterns of thecluster members Genes (rows) are labeled with LocusLink ID numbers Relative levels (log2 ratios) of expression are shown for each gene in eachtissue green cells show underexpression relative to the median black cells show median expression levels and red cells show overexpression

2064 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

node

deg

ree

0

10

20

30

40

50

60

001 002 003 004 005 007 009 012 018 1190

10

20

30

40

50

60

039 045 049 053 056 060 064 070 079 5980

10

20

30

40

50

60

015 021 026 030 034 038 042 048 053 0690

10

20

30

40

50

60

015 020 024 027 031 034 037 041 048 062

dSdN 3rsquo UTR d 5rsquo UTR d

FIG 5mdashThe dependence between the node degree in the human coexpression network and substitution rate Average (6 standard error) numbers of coexpressed human genes (r 07) per gene for 10ascending bins of human-mouse orthologous gene substitution rates are shown

fract

ion

r gt=

07

0008

0012

0016

002

001 002 003 004 005 007 009 013 019 1190008

0012

0016

002

003 006 008 012 015 019 024 030 041 5880008

0012

0016

002

005 010 015 021 027 034 042 053 069 1950008

0012

0016

002

004 008 012 016 021 026 033 042 056 133

FIG 6mdashCoexpressed human genes have similar substitution rates Fractions of pairwise correlation coefficient values where r 07 between human genes are shown for 10 ascending bins of pairwisehuman-mouse orthologous gene substitution rate differences

Evolutio

nandtheHuman

GeneCoexpressio

nNetw

ork

2065

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

a phyletic pattern that is the pattern of presenceabsenceof orthologous genes from the given KOG among seveneukaryotic species The phyletic patterns of KOGs werecompared to determine their similarity in terms ofpropensities for gene gain and loss over the approximately18 billion years of eukaryotic crown group evolution(Krylov et al 2003) A statistically significant positivecorrelation was detected between the similarity of evolu-tionary patterns of gain-loss and the level of coexpressionamong pairs of human genes (fig 7 and table 1) Thuscoexpressed genes tend to evolve under similar long-termevolutionary constraints

Human-Mouse Sequence Divergence Versus ExpressionProfile Divergence

Expression data collected for human and mouse werefurther compared to assess levels of regulatory divergence

(divergence of expression profiles) between orthologousgenes in the two species From the original series ofmicroarray experiments (Su et al 2002) 19 tissues thatwere studied in both human and mouse were identifiedRelative levels of gene expression across these tissues inboth species were taken as vectors and pairs of vectorsfor human-mouse orthologous genes were compared tomeasure the extent of regulatory divergence betweenspecies In a general agreement with the original report (Suet al 2002) the r-values for orthologs displayed a narrowdistribution with the median at approximately 04 Thiswas in sharp contrast to the distribution of r-values fornonorthologous human and mouse genes which hada median of approximately 0 (fig 8) The differencebetween the two distributions is highly statisticallysignificant (P 10210) Surprisingly however whenthe resulting r-values were compared with the levels ofsequence divergence to determine whether sequence andregulatory divergence were correlated no significantcorrelation between the human-mouse expression patterndivergence and the four different measures dN dSd(59 UTR) and d(39 UTR) of human-mouse sequencedivergence was detected (fig 9 and table 1) This findingstands in contrast with the results of the recent analysis ofthe connection between the sequence and expressiondivergence between paralogous human genes where asignificant negative correlation was observed betweenthe correlation coefficients of the expression profiles andsequence divergence (Makova and Li 2003) However anearlier study of yeast duplicate genes found no correlation

phyletic similarity

0008

001

0012

0014

0016

0018

002

-079 -066-027-022 -010 018 030 062 099 100

fract

ion

r gt=

07

FIG 7mdashCoexpressed human genes have similar phyletic patternsThe fractions of coexpressed gene pairs (r 07) are shown for 10ascending bins of phyletic similarity values (see Materials and Methods)

Table 1Evolutionary Rate Versus Gene Expression Comparisons

Comparison Ra Pb nc

dN vs expression breadth 21000 55endash7 3906dS vs expression breadth 20976 21endash5 390659 UTR d vs expression breadth 20539 ns 112139 UTR d vs expression breadth 20964 49endash5 1438

dN vs expression level 21000 55endash7 3906dS vs expression level 20952 11endash4 390659 UTR d vs expression level 20624 ns 112139 UTR d vs expression level 20964 49endash5 1438

dN vs number coexpressed genes 20964 49endash5 2307dS vs number coexpressed genes 20746 17endash2 230759 UTR d vs number coexpressedgenes 20588 ns 1119

39 UTR d vs number coexpressedgenes 20818 58endash3 1436

dN difference vs fraction r 07 20976 21endash5 2659971dS difference vs fraction r 07 20249 ns 265997159 UTR d difference vs fractionr 07 20812 58endash3 625521

39 UTR d difference vs fractionr 07 20867 22endash3 1030330

dN vs expression divergence 0139 ns 551dS vs expression divergence 20079 ns 55159 UTR d vs expression divergence 0164 ns 27339 UTR d vs expression divergence 20552 ns 382

a Spearman rank correlation coefficient for the average expression parameter

value against the 10 bins of substitution rate values (see figures 1 and 3ndash5)b Level of significance (n frac14 10) associated with R values ns frac14 nonsignifi-

cant (ie P 005)c The total number of comparisons that were binned

0

01

02

03

04

05

06

07

08

09

1

-1 -05 0 05 1r

cum

ulat

ive

frequ

ency

FIG 8mdashCumulative distributions of pairwise correlation coefficientvalues (r) The black line is the distribution for the comparisons ofnonorthologous genes within the human and mouse gene sets The grayline is the distribution for orthologous human-mouse gene comparisons

2066 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

between gene sequence and gene expression patterndivergence (Wagner 2000) The apparent disparity be-tween orthologs and paralogs in terms of expressionevolution in mammals seems to suggest unexpecteddifferences in the evolutionary modes and deserves furtherinvestigation

Conclusion

The results described here indicate that mammaliancoexpressed genes form a scale-free network whichprobably evolved by self-organization based on thepreferential attachment principle One possible mechanism

of this evolution is the duplication of genes together withtheir transcriptional control regions as suggested for theyeast coexpression network (Bhan Galas and Dewey2002) An examination of the contribution of paralogousgene pairs to the gene coexpression network analyzed hereindicates that such duplicated pairs (see Materials andMethods) are found approximately two times morefrequently than expected by chance (P 10210)However the fraction of coexpressed gene pairs that arerelated by duplication is small (028) suggesting thatduplication may not be a major force affecting thestructure of the gene coexpression network This findingseems to be consistent with an analysis of yeast protein

Table 2Examples of Two Human Gene Coexpression Network Clusters

Locus Link IDGenBank accessiona Gene Symbolb Annotationc Physiological Function

A Pancreas-Specific Gene Cluster

1208P16233 CLPS colipase preproprotein Pancreatic digestive enzyme1357NP_001859 CPA1 pancreatic carboxypeptidase A1 precursor Pancreatic digestive enzyme1358NP_001860 CPA2 pancreatic carboxypeptidase A2 precursor Pancreatic digestive enzyme2641P01275 GCG glucagon preproprotein Pancreatic islet hormone2813AAB19240 GP2 glycoprotein 2 (zymogen granule

membrane)The major membrane protein in thesecretory granule of theexocrine pancreas

2906AAB49992 GRIN2D N-methyl-D-aspartate receptor subunit2D precursor

Implicated in pancreatic hormonesecretionPMID 7768362

148223 NP_689695 hypothetical protein FLJ36666 Uncharacterized protein25803NP_036523 SPDEF SAM pointed domainndashcontaining ets

transcription factorProstate epithelium-specific transcriptionfactor

84444NP_115871 histone methyltransferase DOT1L Regulator of gene dynamics andchromatin activity

9436NP_004819 NCR2 natural cytotoxicity triggering receptorndash2 Mediator of killer cellsrsquocytotoxicity9610AAB67270 RIN1 ras inhibitor RIN1 A dominant negative inhibitor of the

RAS signaling pathway9753AAH41661 ZNF305 zinc finger proteinndash305 Predicted transcription regulator

A Testis-Specific Gene Cluster

6847Q15431 SYCP1 synaptonemal complex proteinndash1 Major component of the transversefilaments of synaptonemalcomplexes during meiotic prophase

10388 Q9BX26 SYCP2 synaptonemal complex proteinndash2 Major component of the axiallateralelements of synaptonemal complexesduring meiotic prophase

27285NP_055281 TEKT2 tektin 2 Major sperm microtubule protein8852NP_647450 AKAP4 A-kinase anchor proteinndash4 isoform 2 Major sperm sheath protein676 AAB87862 BRDT testis-specific bromodomain protein Testis-specific chromatin-associated protein

involved in chromatin remodeling11077 O75031 HSF2BP heat shock transcription factorndash2 binding

proteinImplicated in modulating HSF2activation in testis

7180P16562 CRISP2 testis-specific proteinndash1 Probable sperm-coating glycoprotein11116NP_919410 FGFR1OP FGFR1 oncogene partner isoform b Implicated in the proliferation and

differentiation of the erythroid lineage4291P58340 MLF1 myeloid leukemia factorndash1 Uncharacterized protein implicated in

cell proliferation5261AAH02541 PHKG2 phosphorylase kinase gamma 2 (testis) Testisliver-specific regulator of

glycogen metabolism54344 O94777 DPM3 dolichyl-phosphate mannosyltransferase

polypeptidendash3 isoform 2Regulator of dolichyl-phosphate mannosebiosynthesis Might be important forsperm maturation PMID 6231179

7288O00295 TULP2 tubby-like proteinndash2 Retina-specific and testis-specificheterotrimeric G-proteinndashresponsiveintracellular signaling factor

51460NP_057413 SFMBT1 Scm-like with four mbt domainsndash1 Polycomb group transcription regulator114049 NP_684281 WBSCR22 Williams Beuren syndrome chromosome

region 22 proteintRNA 5-cytosine methylase

a Locus Link identification numbers and GenBank accessions from NCBIrsquos LocusLink database httpwwwncbinlmnihgovLocusLinkindexhtmlb Official gene symbol from the Human Genome Organization (HUGO) httpwwwgeneuclacuknomenclaturec Official HUGO gene name

Evolution and the Human Gene Coexpression Network 2067

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

interaction networks which showed that the networkstructure is not predominantly shaped by duplication(Wagner 2003)

Our results show that overall levels of expressionand more specific topological properties of the gene co-expression network are clearly related to the rates ofsequence evolution highly connected network hubs tendto evolve slowly and genes that are coexpressed showa strong tendency to evolve at similar rates Theserelationships most likely reflect purifying selection whichappears to act primarily at the level of the proteinsequence Generally these results are compatible withthe notion of greater biological importance of highlyconnected nodes of biological networks (Jeong et al2001) However two of the findings reported here appearto be distinctly surprising Firstly we found that theconnection between gene coexpression network topologyand sequence evolution held for both nonsynonymous andsynonymous sites in the coding sequence and 39 UTR butnot for the 59 UTR Thus the constraints on 59 UTRevolution appear to be unrelated to expression regulationas analyzed here Secondly we found that although theexpression profiles of orthologs tend to be stronglycorrelated they diverged at roughly the same rate acrossa wide range of sequence evolution rates Thus unlike theproperties of the gene coexpression network evolution ofan individual genersquos expression regulation after speciationseems to be uncoupled from the functional constraints onthe genersquos sequence This is generally compatible withthe pregenomic notion that regulatory evolution could bethe decisive factor of biological diversification betweenspecies (Britten and Davidson 1969 King and Wilson1975) and can be taken to suggest that sequence diver-gence and expression pattern divergence are governed bydistinct forces

Recent studies have reported the rapid diversificationof gene expression patterns and suggested that this mightreflect neutral evolution of transcriptional regulation(Khaitovich et al 2004 Yanai Graur and Ophir 2004)The lack of correlation between sequence divergence andexpression profile divergence demonstrated here could bedeemed as compatible with the neutral model wherebytranscription profiles of orthologs diverge rapidly uponspeciation until they reach a basal level of similarity that isthen maintained by purifying selection However it istempting to speculate as to a distinct role for naturalselection in driving expression pattern divergence Perhapswhereas sequence divergence levels are determined largelyby purifying selection the effects of adaptive (diversifying)selection are more prevalent at the level of gene expressionUnder this hypothetical scenario although the basal level ofcorrelation between orthologs tends to persist probablyreflecting the conservation of the general function theaspects of the gene expression patterns that reflect species-specific changes governed in part by short degeneratetranscription factorndashbinding sites would be expected todiverge rapidly Indeed our analysis of the relationshipbetween gene coexpression in this work and geneduplication as well as previous results (Gu et al 2002)suggest rapid divergence of expression patterns afterduplication Once the expression patterns of homologous

correlation 01

02

03

04

05

06

001

002

003

004

005

007

009

012

019

052

01

02

03

04

05

06

040

046

050

055

058

062

065

071

079

127

01

02

03

04

05

06

017

024

032

037

045

053

063

072

088

137

01

02

03

04

05

06

017

023

029

034

040

044

050

059

073

120

3rsquo U

TR d

5rsquo

UTR

d

dNdS

FIG9mdash

Thelack

ofdependence

betweenhuman-m

ouse

generegulatory

divergence

andsequence

divergenceAveragecorrelationcoefficientvalues

(r)betweenexpressionprofilesofhuman-m

ouse

orthologousgeneareshownfor10ascendingbinsofhuman-m

ouse

orthologousgenesubstitutionrates

2068 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

genes (paralogs or orthologs) diverge convergent evolu-tion which is a hallmark of adaptation and is widely evidentfor phenotypic characters could cause unrelated genes toachieve similar expression patterns This convergence couldbe related to the rapid de novo generation of transcriptionfactorndashbinding sites in promoters that lead to slight changesin expression patterns If the functionally relevant aspectsof the ancestral expression pattern are maintained duringthis process subtle expression pattern changes would besimultaneously invisible to purifying selection and serve asthe raw material for repeated trials of adaptive selectionThis hypothesis yields a specific prediction with regard tothe evolution of promoter regions that is currently beinginvestigated Expression profiles are predicted to be morecorrelated with the distributions of specific transcriptionfactorndashbinding sites along promoters than with the overallsequence divergence (ie relatedness) between promoters

Acknowledgments

We thank Igor Rogozin David Landsman FyodorKondrashov Itai Yanai and Katerina Makova for helpfuldiscussions

Supplementary Material

Supplementary Figure 1mdashThe dependence betweenexpression breadth level and substitution rates of mousegenes (andashd) Average (6 standard error) expressionbreadth values for 10 ascending bins of human-mouseorthologous gene substitution rates (endashh) Average (6standard error) expression level values for 10 ascendingbins of human-mouse orthologous gene substitution rates

Supplementary Figure 2mdashFrequency distributions ofhuman-mouse substitution rate ratios Ratios shown inlog scale (a) d(59 UTR)d(39 UTR) d(59 UTR)d(39 UTR)is shown in log scale In 43 of genes d(59 UTR) d(39UTR) whereas in 57 d(39 UTR) d(59 UTR) (b)d(59 UTR)dS (c) d(39 UTRdS)

Supplementary Figure 3mdashFrequency distributions ofthe number of links per node in human gene coexpressionnetworks For each network the correlation coefficientvalue (r) threshold used to determine whether any twogenes (nodes) are considered to be coexpressed (linked)are shown above the plot The slope of the linear trend linethat fits the data and the r2 value indicating the goodness offit to the power law are shown for each network plot

Supplementary Figure 4mdashTissue-specific expressionpatterns for human gene coexpression network (r 09)clusters Cluster expression patterns 1 adult and fetal liver2 fetal liver 3 testis 4 pancreas 5 testis and umbilicalvein 6 various brain tissues 7 salivary gland 8 adultliver 9 thymus 10 spleen 11 lung 12 cerebellum

Supplementary Figure 5mdashRelationship between themouse coexpression network topology and substitutionrates (a) The dependence between the node degree andsubstitution rate Average (6 standard error) numbers ofcoexpressed genes (node degree for r 07) per gene for10 ascending bins of human-mouse orthologous gene sub-stitution rates are shown (b) Coexpressedmouse genes havesimilar substitution rates Fractions of pairwise correlation

coefficient values where r 07 between mouse genes areshown for 10 ascending bins of pairwise human-mouseorthologous gene substitution rate differences

Literature Cited

Agrawal H 2002 Extreme self-organization in networksconstructed from gene expression data Phys Rev Lett89268702

Ashburner M C A Ball J A Blake et al 2000 Geneontology tool for the unification of biology The GeneOntology Consortium Nat Genet 2525ndash29

Barabasi A L and R Albert 1999 Emergence of scaling inrandom networks Science 286509ndash512

Barabasi A L and Z N Oltvai 2004 Network biologyunderstanding the cellrsquos functional organization Nat RevGenet 5101ndash113

Bergmann S J Ihmels and N Barkai 2004 Similarities anddifferences in genome-wide expression data of six organismsPLoS Biol 20085ndash0093

Bhan A D J Galas and T G Dewey 2002 A duplicationgrowth model of gene expression networks Bioinformatics181486ndash1493

Bloom J D and C Adami 2003 Apparent dependence ofprotein evolutionary rate on number of interactions is linkedto biases in protein-protein interactions data sets BMC EvolBiol 321

Britten R J and E H Davidson 1969 Gene regulation forhigher cells a theory Science 165349ndash357

Carbone A A Zinovyev and F Kepes 2003 Codon adaptationindex as a measure of dominating codon bias Bioinformatics192005ndash2015

Darwin C 1859 On the origin of species John Murray LondonDuret L and D Mouchiroud 2000 Determinants of sub-

stitution rates in mammalian genes expression pattern affectsselection intensity but not mutation rate Mol Biol Evol1768ndash74

Eisen M B P T Spellman P O Brown and D Botstein1998 Cluster analysis and display of genome-wide expressionpatterns Proc Natl Acad Sci USA 9514863ndash14868

Farris J S 1977 Phylogenetic analysis under Dollorsquos law SystZool 2677ndash88

Fraser H B A E Hirsh L M Steinmetz C Scharfe andM W Feldman 2002 Evolutionary rate in the protein inter-action network Science 296750ndash752

Fraser H B D P Wall and A E Hirsh 2003 A simpledependence between protein evolution rate and the number ofprotein-protein interactions BMC Evol Biol 311

Giot L J S Bader C Brouwer et al 2003 A protein inter-action map of Drosophila melanogaster Science 30217271736

Gu Z D Nicolae H H Lu and W H Li 2002 Rapiddivergence in expression between duplicate genes inferredfrom microarray data Trends Genet 18609ndash613

Hedges S B H Chen S Kumar D Y Wang A S Thompsonand H Watanabe 2001 A genomic timescale for the origin ofeukaryotes BMC Evol Biol 14

Higgins D G J D Thompson and T J Gibson 1996 UsingCLUSTAL for multiple sequence alignments MethodsEnzymol 266383ndash402

Hirsh A E and H B Fraser 2001 Protein dispensability andrate of evolution Nature 4111046ndash1049

Ho Y A Gruhler A Heilbut et al 2002 Systematic iden-tification of protein complexes in Saccharomyces cerevisiaeby mass spectrometry Nature 415180ndash183

Evolution and the Human Gene Coexpression Network 2069

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Hoyle DCMRattray R Jupp andABrass 2002Making senseof microarray data distributions Bioinformatics 18576ndash584

Jeong H S P Mason A L Barabasi and Z N Oltvai 2001Lethality and centrality in protein networks Nature 41141ndash42

Jeong H B Tombor R Albert Z N Oltvai and A LBarabasi 2000 The large-scale organization of metabolicnetworks Nature 407651ndash654

Jordan I K I B Rogozin Y I Wolf and E V Koonin 2002Essential genes are more evolutionarily conserved than arenonessential genes in bacteria Genome Res 12962ndash968

Jordan I K Y I Wolf and E V Koonin 2003 No simpledependence between protein evolution rate and the number ofprotein-protein interactions only the most prolific interactorstend to evolve slowly BMC Evol Biol 31

Jukes T H and C R Cantor 1969 Evolution of proteinmolecules Pp 21ndash132 in H N Munro ed Mammalianprotein metabolism Academic New York

Kamath R S A G Fraser Y Dong et al 2003 Systematicfunctional analysis of the Caenorhabditis elegans genomeusing RNAi Nature 421231ndash237

Karev G P Y I Wolf A Y Rzhetsky F S Berezovskayaand E V Koonin 2002 Birth and death of protein domainsa simple model of evolution explains power law behaviorBMC Evol Biol 218

Khaitovich P G Weiss M Lachmann I Hellman W EnardB Muetzel U Wiekner W Ansorge and S Paabo 2004 Aneutral model of transcriptome evolution PLOS Biol20682ndash0689

King M C and A C Wilson 1975 Evolution at two levels inhumans and chimpanzees Science 188107ndash116

Koonin E V N D Fedorova J D Jackson et al 2004 Acomprehensive evolutionary classification of proteins encodedin complete eukaryotic genomes Genome Biol 5R7

Koonin E V Y I Wolf and G P Karev 2002 The structureof the protein universe and genome evolution Nature420218ndash223

Krylov D M Y I Wolf I B Rogozin and E V Koonin2003 Gene loss protein sequence divergence gene dispens-ability expression level and interactivity are correlated ineukaryotic evolution Genome Res 132229ndash2235

Kuznetsov V A G D Knott and R F Bonner 2002 Generalstatistics of stochastic process of gene expression ineukaryotic cells Genetics 1611321ndash1332

Lander E S L M Linton B Birren et al 2001 Initialsequencing and analysis of the human genome Nature 409860ndash921

Li S C M Armstrong N Bertin et al 2004 A map of theinteractome network of the metazoan C elegans Science 303540ndash543

Li W H 1997 Molecular evolution Sinauer AssociatesSunderland Mass

Luscombe N M J Qian Z Zhang T Johnson and MGerstein 2002 The dominance of the population by a selected

few power-law behaviour applies to a wide variety ofgenomic properties Genome Biol3RESEARCH0040

Makova K D and W H Li 2003 Divergence in the spatialpattern of gene expression between human duplicate genesGenome Res 131638ndash1645

Nei M and T Gojobori 1986 Simple methods for estimatingthe numbers of synonymous and nonsynonymous nucleotidesubstitutions Mol Biol Evol 3418ndash426

Pal C B Papp and L D Hurst 2001 Highly expressed genesin yeast evolve slowly Genetics 158927ndash931

mdashmdashmdash 2003 Genomic function rate of evolution and genedispensability Nature 421496ndash497

Pennisi E 2003 Systems biology tracing lifersquos circuitryScience 3021646ndash1649

Ravasz E A L Somera D A Mongru Z N Oltvai andA L Barabasi 2002 Hierarchical organization of modularityin metabolic networks Science 2971551ndash1555

Rzhetsky A and S M Gomez 2001 Birth of scale-freemolecular networks and the number of distinct DNA andprotein domains per genome Bioinformatics 17988ndash996

Su A I M P Cooke K A Ching et al 2002 Large-scaleanalysis of the human and mouse transcriptomes Proc NatlAcad Sci USA 994465ndash4470

Tatusov R L N D Fedorova J D Jackson et al 2003 TheCOG database an updated version includes eukaryotes BMCBioinformatics 441

Ueda H R S Hayshi S Matsuyama T Yomo S HashimotoS A Kay J B Hogenesch and M Iino 2004 Universalityand flexibility in gene expression from bacteria to humanProc Natl Acad Sci USA 1013765ndash3769

Wagner A 2000 Decoupled evolution of coding region andmRNA expression patterns after gene duplication implica-tions for the neutralist-selectionist debate Proc Natl AcadSci USA 976579ndash6584

mdashmdashmdash 2003 How the global structure of protein interaction net-works evolves Proc R Soc Lond B Biol Sci 270457ndash466

Waterston R H K Lindblad-Toh E Birney et al 2002 Initialsequencing and comparative analysis of the mouse genomeNature 420520ndash562

Yanai I D Graur and R Ophir 2004 Incongruent expressionprofiles between human and mouse orthologous genes suggestwidespread neutral evolution of transcription control OMICS815ndash24

Yang J ZGu andWHLi 2003Rate of protein evolution versusfitness effect of gene deletion Mol Biol Evol 20772ndash774

Yang Z 1997 PAML a programpackage for phylogenetic analysisby maximum likelihood Comput Appl Biosci 13555ndash556

Zhang L and W H Li 2004 Mammalian housekeeping genesevolve more slowly than tissue-specific genes Mol BiolEvol 21236ndash239

Peer Bork Associate Editor

Accepted July 14 2004

2070 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Page 2: Conservation and Coevolution in the Scale-Free Human Gene

2000 2001 Luscombe et al 2002 Ravasz et al 2002)Studies of the statistical properties of the topologies ofsuch networks suggest specific mechanisms that governtheir evolution In particular many biological networksshow scale-free topological properties that can be ex-plained by a model of network growth via preferentialattachment of new nodes to existing nodes that are alreadyhighly connected (Barabasi and Albert 1999) At thegenomic level gene duplication is thought to underlie thephenomenon of preferential attachment (Rzhetsky andGomez 2001 Bhan Galas and Dewey 2002 Barabasi andOltvai 2004) Existing highly connected nodes (ie genesor proteins) are more likely simply by virtue of their largenumber of connections to be linked to nodes that areduplicated Because the duplicated nodes are expected tomaintain the same links as the ancestral singleton theconnectivity of a highly connected node will increase withduplication (Barabasi and Oltvai 2004) This process alonecan lead to network growth by preferential attachment Theubiquity of scale-free network topological patterns sug-gests that network growth by preferential attachment viamechanisms such as gene duplication is a fundamentaland conserved evolutionary process

In this work we attempted to integrate the twoperspectives of natural selection and physical self-organization to analyze the evolution of gene sequenceand expression patterns Comparison of human and mousegenome sequence data was combined with the analysis ofgene expression profiles that were derived from microarrayexperiments on a number of tissues in both speciesHuman gene expression profiles were used to reconstructa network of coexpressed genes and we demonstrate theeffect of the network topology on the strength of naturalselection as well as an unexpected relationship betweenhuman-mouse gene sequence and gene expression di-vergence These results underscore the influence ofexpression network self-organization on gene evolutionand suggest that distinct mechanisms are responsible forthe evolution of expression patterns and gene sequences

Materials and Methods

Human and mouse gene expression levels were takenfrom a recently published series of Affymetrix microarrayexperiments (Su et al 2002) and were retrieved from theGene Expression Omnibus database at the National Centerfor Biotechnology Information (NCBI) The two data setflat filesmdashGDS181soft (human) and GDS182soft(mouse)mdashwere downloaded from ftpftpncbinihgovpubgeodatagdssoft Affymetrix probe identifiers weremapped to individual loci in the human and mousegenomes using the LocusLink database (NCBI NIHBethesda) A total of 7383 human and 6724 mouse locicorresponded to the probe identifiers The results ofmicroarray experiments on cancerous tissues were re-moved to yield profiles of normal mammalian tran-scriptomes This left data from a total of 63 human and89 mouse microarray experiments that were subsequentlyanalyzed Levels of expression recorded as averagedifference (AD) values for redundant experiments (iestudies of the same tissue samples) were averaged before

analysis Because negative AD values represent morenoise than signal AD values were clipped at a value of 20For the purpose of determining the breadth of expressionan AD value of 200 was taken as a threshold to considera gene to be expressed in a given tissue (Su et al 2002)and the number of tissues where a gene was expressedwas counted For the purpose of determining the levelof expression the sum of the log2 normalized AD valuesover all tissues was taken The similarity between geneexpression patterns was determined using the Pearsoncorrelation coefficient (r) (Eisen et al 1998) Beforethis correlation analysis unexpressed genes (AD 200in all tissues) and nondifferentially expressed genes(maximum ADminimum AD 10) were removed fromthe data matrices In addition to control for the effects ofthe overall level of gene expression AD values werenormalized for each gene by taking the log2 value of theratio of the tissue-specific AD valuemedian AD value forall tissues

Pairwise correlations between gene expression pat-terns were used to derive the gene coexpression networkThe fit of the network node degree distributionmdashthat isthe frequency distribution of the number of genes f(n) thathave n coexpressed genesmdashto a theoretical distribution(the generalized Pareto distribution here) was done aspreviously described (Karev et al 2002 Koonin Wolfand Karev 2002) It should be noted that because theavailable node degree data span only approximately 25orders of magnitude a formal possibility remains that thecoexpression data are also compatible with models otherthan the scale-free model adopted here The clusteringcoefficient (C) was calculated for each node as the ratioof the number of the actual connections between theneighbors of the node to the number of possible con-nections between them (Barabasi and Oltvai 2004)

Human and mouse gene sequences were extractedfrom the RefSeq database (NCBI NIH Bethesda) Human-mouse orthologs were identified as reciprocal best Blast hitsbetween protein sequences as previously described (JordanWolf and Koonin 2003) Human-mouse orthologousprotein sequences were aligned using ClustalW (HigginsThompson and Gibson 1996) and the protein alignmentswere used to guide alignments of the correspondingnucleotide coding sequences to ensure that they werealigned in frame The 59 and 39 UTR sequences werealigned using ClustalW and only alignments where theshortest sequence had more than 20 residues and had nomore than 50 differences in the number of residues (ielength) between aligned sequences were used for furtheranalysis Synonymous (dS) and nonsynonymous (dN) sub-stitution rates were calculated for alignments of protein-coding sequences using the Nei-Gojobori method (Nei andGojobori 1986) implemented in the PAML package (Yang1997) The 59 and 39 UTR substitution rates (d) werecalculated using the Jukes-Cantor correction for multiplesubstitutions (Jukes and Cantor 1969) Paralogous geneswere identified using an all-against-all BlastP search of theproteins in the coexpression (r 07) network with ane-value threshold of 1025 and a coverage cutoff suchthat more than 50 of the shorter protein sequence had tobe included in the high scoring segment pair

Evolution and the Human Gene Coexpression Network 2059

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Comparisons between substitution rates and variousgene expression parameters (expression breadth expressionlevel and correlations) were done by sorting the rates orrate differences in the ascending order and then binning thesorted values into 10 equal-sized bins Average and frac-tional values of gene expression parameters for each sub-stitution rate bin were compared with respect to the order ofthe bins and the Spearman rank correlation (R) along withits statistical significance (P-value at n frac14 10 for all com-parisons) was computed for each comparison (table 1)Regular correlation coefficients (rXY) and partial correla-tion coefficients (rXYZ) were Fisher Zndashtransformed andthe significance of the differences between them was calcu-lated using a z-test

Phyletic patterns for human genes were taken fromthe eukaryotic Clusters of Orthologous Groups of proteins(KOGs) database (Tatusov et al 2003 Koonin et al 2004)For each KOG its phyletic pattern corresponds to thepresenceabsence state of the KOG member proteinsamong the seven eukaryotic species included in thedatabase For each phyletic pattern a specific evolutionaryscenario was determined by mapping the states (presenceor absence) and events (gain or loss) to the species treeof the seven organisms in KOGs using the Dollo parsi-mony method (Farris 1977) Pairs of scenarios were thencompared to derive values of the phyletic similaritymeasure This was done by scoring the combinations ofstates and events on each branch of the species tree andsumming across all branches For each branch the specificcombination of states and events for a pair of scenarioswas considered with respect to the branch length andthe branch specific propensities for gene gain and loss Thebranch lengths (MYR) on the species tree were taken asdescribed previously (Hedges et al 2001 Krylov et al2003) The branch-specific relative propensities for genegain (PG

i ) and gene loss (PLi ) were calculated using the

total number of gene gains and losses mapped to eachbranch (Koonin et al 2004) according to the followingformula

PGi frac14 Gi

Xj

Nj

Ni

Xj

Gj PLi frac14 Li

Xj

Nj

Ni

Xj

Lj

where Ni Gi and Li are the numbers of present genesgains and losses mapped to the ith branch The scoringschemes for all possible combinations of states and eventsare shown at ftpftpncbinihgovpubkooninJordanMBE-04-0138R1SupplementaryTable4doc The sum ofscores across all branch lengths was divided by the sumof the respective normalization scores (see table 4 inSupplementary Material online) to provide the patternsimilarity scores (range from 21 to 1)

Results and DiscussionExpression Level and Breadth VersusSequence Divergence

A recent large-scale microarray study that includesquantitative analysis of gene expression patterns for 31

human and 46 mouse tissues yielded detailed profiles oftwo mammalian transcriptomes (Su et al 2002) We usedthese expression data along with comparative human-mouse gene sequence analysis to evaluate the relationshipbetween gene sequence evolution and gene expressiondivergence on a genomic scale For each human-mouseorthologous gene pair the number of tissues where it isexpressed (expression breadth) and total level of expres-sion were determined (see Materials and Methods) Thesevalues were compared to several measures of human-mouse orthologous gene sequence divergence the synon-ymous (dS) and nonsynonymous (dN) protein-codingsequence substitution rates as well as the substitution rates(d) for the 59 and 39 untranslated regions (UTRs) Asreported previously (Duret and Mouchiroud 2000 PalPapp and Hurst 2001 Zhang and Li 2003 Zhang and Li2004) genes that are more widely expressed and genesthat are more highly expressed are more evolutionarilyconserved (ie evolve more slowly) than genes withnarrower and lower overall levels of expression (fig 1 andtable 1 and see supplementary figure 1) The negativecorrelation between expression levels and substitutionrates was more pronounced for dN than for dS (fig 1 andtable 1 and see supplementary figure 1) This suggests thatrelatively slow evolution of highly expressed genesdepends more on the strict functional constraints on theprotein structure than on adaptation of codon usage forexpression level The connection between gene expressionand evolutionary rate was far stronger for the 39 UTRs thanfor the 59 UTRs (fig 1 and table 1 and see supplementaryfigure 1) This seems to indicate that on average 39 UTRscontain more cis-regulatory sites that are functionallyconstrained in highly expressed genes than do 59 UTRsOverall the 59 and 39 UTRs have similar rates ofevolution the median of the ratio d(59 UTR)d(39 UTR)is approximately 112 (see supplementary figure 2a)Furthermore both 39 UTRs and 59 UTRs were found toevolve somewhat slower on average than the synonymouspositions (see supplementary figures 2b and c) Thus 59UTRs appear to evolve under functional constraints thatare nearly as strong as those that affect the 39 UTRs andeven stronger than those for the synonymous positions ofthe coding region However for the 59 UTRs theseconstraints appear to be unrelated to expression breadthand level as measured here

Human Gene Coexpression Network

Tissue-specific expression patterns of human geneswere further compared to identify coexpressed genes Foreach differentially expressed human gene the normalizedexpression levels in each of the 31 tissues were used toconstruct a vector which was compared with similarlyderived vectors of other human genes using the Pearsoncorrelation coefficient (r) (Eisen et al 1998) Pairs of geneswith high r-values are considered to be coexpressedValues of r were determined for all pairs of human genesThese data were used to infer a network of coexpressedgenes where the genes are nodes that are connected by anedge if they share an r-value greater than or equal to aspecified threshold This was done using a series of r-value

2060 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

expr

essi

on le

vel

180

190

200

210

220

230

240

250

015 021 026 030 034 038 042 047 053 069180

190

200

210

220

230

240

250

001 002 003 004 005 007 009 012 018 164180

190

200

210

220

230

240

250

039 045 049 053 056 060 065 071 081 598180

190

200

210

220

230

240

250

015 020 024 028 031 034 037 042 048 062

dSdN 3rsquo UTR d

expr

essi

on b

read

th

6

8

10

12

14

16

18

001 002 003 004 005 007 009 012 018 1646

8

10

12

14

16

18

039 045 049 053 056 060 065 071 081 5986

8

10

12

14

16

18

015 021 026 030 034 038 042 047 053 0696

8

10

12

14

16

18

015 020 024 028 031 034 037 042 048 062

a

b

5rsquo UTR d

dSdN 3rsquo UTR d 5rsquo UTR d

FIG 1mdashThe dependence between expression breadth expression level and substitution rates of human genes (andashd) Average (6 standard error) human gene expression breadth values for 10 ascendingbins of human-mouse orthologous gene substitution rates (endashh) Average (6 standard error) human gene expression level values for 10 ascending bins of human-mouse orthologous gene substitution rates

Evolutio

nandtheHuman

GeneCoexpressio

nNetw

ork

2061

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

thresholds (09 2 04) and the topological properties ofthe resulting series of networks were investigated byanalyzing their node degree distributions that is thefrequency distributions of the number of genes f(n) thathave n coexpressed genes As the r-value thresholdincreases the number of edges in the network decreasesand the node degree distribution seems to tend to a powerlaw distribution (see supplementary figure 3) Node degreedistributions and graphic representations of the corre-sponding network topologies for r 09 08 and 07 areshown in figure 2 Because the distribution for r 07shows a good fit to a distribution with a power law tailwhile still retaining enough data for meaningful statisticalanalysis (fig 2 and see supplementary figure 3) thethreshold of 07 was chosen for further analysis A listof coexpressed gene pairs at r 07 is shown at ftpftpncbinihgovpubkooninJordanMBE-04ndash0138R1SupplementaryTable1tab

The node degree distribution for r 07 displays agood fit to a generalized Pareto distribution where f(n) rsquo(n 1 134)2117 (fig 3a) This distribution has a power lawtail which implies asymptotically scale-free properties(Barabasi and Albert 1999) Such scale-free networks haveno single characteristic node degree and are dominated bya small number of highly connected hubs (Barabasi andAlbert 1999) in this case genes that are coexpressed withmany other genes Similar scale-free properties have beenpreviously detected for node degree distributions of severalother gene expressionndashrelated parameters For instance

gene expression levels (Hoyle et al 2002 KuznetsovKnott and Bonner 2002 Luscombe et al 2002) as well aschanges in gene expression level (Ueda et al 2004) havebeen shown to follow power law distributions Coexpres-sion networks derived with different techniques for severalcancers (Agrawal 2002) and for yeast plants and animals(Bhan Galas and Dewey 2002 Bergmann Ihmels andBarkai 2004) also show scale-free properties Finallya number of other biological and other evolving networksincluding protein-protein interactions metabolic networkssocial contacts networks and the Internet (Barabasi andAlbert 1999 Jeong et al 2000 2001 Ravasz et al 2002Barabasi and Oltvai 2004) have node degree distributionsthat follow a power law and thus imply scale-freeproperties The principal mode of evolution of scale-freenetworks is thought to be preferential attachment of newnodes to those that are already highly connected (ie lsquolsquotherich get richerrsquorsquo or lsquolsquothe fit get fitterrsquorsquo model [Barabasi andAlbert 1999])

Obviously there is a transitive property to thecorrelation coefficients that are used to measure coex-pression and connect nodes in the coexpression network Ifthe tissue-specific expression pattern of gene A is cor-related with that of gene B and the pattern of gene B iscorrelated with that of gene C then one should expectexpression of gene A to be correlated with that of gene CHowever the level of correlation between A and C is anopen question because some groups of genes can be tightlycoregulated whereas others are only loosely coexpressed

FIG 2mdashThe human gene coexpression network Node degree distributions and network topologies are shown for networks where coexpressedgenes (nodes) are linked if (a) r 09 (b) r 08 and (c) r 07

2062 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

To further evaluate the topological properties of the humangene coexpression network the clustering of networknodes was analyzed The clustering coefficient (C) ofa given node is the ratio of the number of the actualconnections between the neighbors of the node to thenumber of possible connections between them (Barabasiand Oltvai 2004) The average C for the gene coexpressionnetwork is 0452 This indicates a high degree of clusteringthat is typical of many scale-free networks (Barabasi andOltvai 2004) however the fact that C 1 indicates onlylimited transitivity in the gene coexpression networkanalyzed here The shape of the dependence of C on thenode degree is thought to be indicative of the mode ofnetwork growth (Barabasi and Oltvai 2004) Furthermorea plot of the clustering coefficient C(n) versus nodedegree (n) shows the absence of any clear trend betweenthe two (fig 3b) C(n) rsquo constant as seen here isconsistent with network growth by simple preferentialattachment of nodes rather than growth by the hierarchicaladdition of modules that is thought to be characterized byC(n) rsquo n21 (Barabasi and Oltvai 2004)

The human gene coexpression network was examinedto assess the functional relationships between coexpressedgenes Pairs of coexpressed genes were functionallyclassified using the eukaryotic KOG database (Tatusov etal 2003) and the Gene Ontology database (Ashburner etal 2000) Using both of these approaches approximately175 of coexpressed gene pairs were found to encode

proteins with the same functional classification a signifi-cant (P 00001) but relatively small excess over therandom expectation (13) However far more strikingnonrandomness was observed when the coexpressed geneswere examined for co-occurrence of different functionscertain combinations of seemingly related functions werestrongly preferred (ftpftpncbinihgovpubkooninJordanMBE-04ndash0138R1SupplementaryTable2tab) Forexample signal transduction genes are coexpressed withthose involved in secretion transcription and posttransla-tional protein modification much more often than expectedby chance This is likely to reflect coexpression of genescoding for proteins that function together in biochemicaland signaling pathways

A number of highly connected individual clustersfrom the human gene coexpression network were furtherexamined with respect to the expression patterns andfunctions of their member genes The majority of theseclusters are made up of genes with narrow if not exclusivetissue-specific expression patterns (see supplementaryfigure 4) Examples of the topologies of two of theseclusters along with their tissue-specific expression patternsare shown in figure 4 The experimentally determined andpredicted functions for the member genes of these twoclusters are generally consistent with their expressionpatterns and indicate their involvement in pancreatic andtestis-related physiological functions respectively (table 2)

Conservation and Coevolution of CoexpressedHuman Genes

For each human gene with an identifiable mouseortholog the number of other human genes that it iscoexpressed with at r 07 (ftpftpncbinihgovpubkooninJordanMBE-04-0138R1SupplementaryTable3tab)was compared with its human-mouse substitution rate(fig 5 and table 1) Genes that have a higher number ofcoexpressed genes (the node degree in the transcriptionexpression network) are substantially more conservedthan genes with fewer coexpressed partners Qualitativelyidentical results are seen when the mouse expression data isused to compare the number of coexpressed mouse geneswith human-mouse evolutionary rates (see supplementaryfigure 5a) As with the relationship between expressionlevel and evolutionary rate the negative correlationbetween the number of coexpressed genes and evolutionaryrates was strongest for dN (fig 5) Thus the proteinproducts of those genes that are hubs of the transcriptioncoexpression network are subject to comparatively highlevels of functional constraint which could reflect theiressential roles in cellular processes (Hirsh and Fraser 2001Jordan et al 2002 Krylov et al 2003) multifunctionalityandor involvement in a large number of protein-proteininteractions (Fraser et al 2002 Jordan Wolf and Koonin2003) The negative correlation between dS and d(39 UTR)and the number of coexpressed genes (fig 5) is likelyto result from purifying selection maintaining tight trans-lational regulation of genes that are highly connected in thecoexpression network In contrast the correlation betweend(59 UTR) and the number of coexpressed genes was notsignificant (fig 5 and table 1)

1

10

100

1000a

b

1 10 100 1000node degree

fre

qu

en

cy

f(n) = 622 (n + 134)-117

P(χ2) ~ 021

00102030405060708

1 10 100 1000node degree

clu

ste

rin

g c

oe

ffic

ie

nt

FIG 3mdashStatistical properties of the human gene coexpressionnetwork (a) Node degree distribution Number of genes f(n) that haveexactly n coexpressed genes Equation and line for the best fittingdistribution are shown Probability associated with the v2 test for thegoodness of fit of the observed data to the theoretical generalized Paretodistribution is shown (b) Clustering coefficient (y-axis) plotted againstthe node degree (x-axis) Average clustering coefficients and standarddeviations of the averages for node degree bins are shown

Evolution and the Human Gene Coexpression Network 2063

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Although the correlation between dN and the numberof coexpressed genes was nearly as strong as that betweendN and expression breadth and level for dS the lattercorrelation was substantially stronger (table 1) This is notsurprising because codon usage adaptation is particularlyimportant for highly expressed genes (Carbone Zinovyevand Kepes 2003) Partial correlation rXYZ where X frac14number of coexpressed genes Y frac14 substitution rate andZfrac14 expression level was used to control for the effects ofexpression level on the relationship between the number ofcoexpressed genes and substitution rates In all caseswhere the substitution rate was found to be significantlycorrelated with the number of coexpressed genesmdashdNdS and d(39 UTR) (table 1)mdashthe application of partialcorrelation did not result in any significant decrease of thecorrelation coefficient that is rXYZ is not significantlysmaller than rXY (038 P 056) Thus the effect ofthe number of coexpressed genes on substitution rates isnot based on any correlation between the number ofcoexpressed genes and the expression level

The relationship between gene coexpression andevolutionary rate was further examined by comparing ther-values between pairs of human gene expression patternswith their pairwise gene substitution rate differences (fig 6and table 1) The results show that bins of human geneswith similar values of dN between human and mouse (iethose genes that evolve at similar rates) have a greaterfraction of pairwise r-values 07 This negative

correlation was substantial and statistically significant(fig 6 and table 1) In contrast the pairwise dS differencebetween human genes did not correlate with the differencein expression patterns (fig 6) Thus genes with similarpatterns of expression tend to evolve at similar rates andthe selection that leads to such coevolution appears tooperate primarily at the protein sequence level The dif-ference in evolution rates of 59 and 39 UTRs also neg-atively correlated with pairwise r-values between humangene expression patterns although the effect statisticallysignificant because of the large number of analyzed genepairs (table 1) was not nearly as pronounced as seen fordN (fig 6) As with other comparisons between evolu-tionary rate and expression patterns the magnitude of thisrelationship was greater for 39 UTRs suggesting that thisregion is more functionally relevant than the 59 UTR withrespect to the pattern of gene expression Qualitativelyidentical results are seen when the mouse expression datawas used to compare r-values between pairs of mousegenes with pairwise human-mouse gene substitution ratedifferences (see supplementary figure 5b)

The connection between gene coexpression andevolutionary conservation on a longer timescale wasinvestigated by considering the expression data alongwith the patterns of phyletic distribution inferred fromthe recently developed collection of eukaryotic KOGs(Tatusov et al 2003 Koonin et al 2004) Each humangene was mapped to a specific KOG and assigned

FIG 4mdashTwo highly connected gene expression clusters form the human gene coexpression network (a) Nodes correspond to genes and arelabeled with LocusLink ID numbers All links between genes correspond to coexpression at r 09 (b) Tissue-specific gene expression patterns of thecluster members Genes (rows) are labeled with LocusLink ID numbers Relative levels (log2 ratios) of expression are shown for each gene in eachtissue green cells show underexpression relative to the median black cells show median expression levels and red cells show overexpression

2064 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

node

deg

ree

0

10

20

30

40

50

60

001 002 003 004 005 007 009 012 018 1190

10

20

30

40

50

60

039 045 049 053 056 060 064 070 079 5980

10

20

30

40

50

60

015 021 026 030 034 038 042 048 053 0690

10

20

30

40

50

60

015 020 024 027 031 034 037 041 048 062

dSdN 3rsquo UTR d 5rsquo UTR d

FIG 5mdashThe dependence between the node degree in the human coexpression network and substitution rate Average (6 standard error) numbers of coexpressed human genes (r 07) per gene for 10ascending bins of human-mouse orthologous gene substitution rates are shown

fract

ion

r gt=

07

0008

0012

0016

002

001 002 003 004 005 007 009 013 019 1190008

0012

0016

002

003 006 008 012 015 019 024 030 041 5880008

0012

0016

002

005 010 015 021 027 034 042 053 069 1950008

0012

0016

002

004 008 012 016 021 026 033 042 056 133

FIG 6mdashCoexpressed human genes have similar substitution rates Fractions of pairwise correlation coefficient values where r 07 between human genes are shown for 10 ascending bins of pairwisehuman-mouse orthologous gene substitution rate differences

Evolutio

nandtheHuman

GeneCoexpressio

nNetw

ork

2065

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

a phyletic pattern that is the pattern of presenceabsenceof orthologous genes from the given KOG among seveneukaryotic species The phyletic patterns of KOGs werecompared to determine their similarity in terms ofpropensities for gene gain and loss over the approximately18 billion years of eukaryotic crown group evolution(Krylov et al 2003) A statistically significant positivecorrelation was detected between the similarity of evolu-tionary patterns of gain-loss and the level of coexpressionamong pairs of human genes (fig 7 and table 1) Thuscoexpressed genes tend to evolve under similar long-termevolutionary constraints

Human-Mouse Sequence Divergence Versus ExpressionProfile Divergence

Expression data collected for human and mouse werefurther compared to assess levels of regulatory divergence

(divergence of expression profiles) between orthologousgenes in the two species From the original series ofmicroarray experiments (Su et al 2002) 19 tissues thatwere studied in both human and mouse were identifiedRelative levels of gene expression across these tissues inboth species were taken as vectors and pairs of vectorsfor human-mouse orthologous genes were compared tomeasure the extent of regulatory divergence betweenspecies In a general agreement with the original report (Suet al 2002) the r-values for orthologs displayed a narrowdistribution with the median at approximately 04 Thiswas in sharp contrast to the distribution of r-values fornonorthologous human and mouse genes which hada median of approximately 0 (fig 8) The differencebetween the two distributions is highly statisticallysignificant (P 10210) Surprisingly however whenthe resulting r-values were compared with the levels ofsequence divergence to determine whether sequence andregulatory divergence were correlated no significantcorrelation between the human-mouse expression patterndivergence and the four different measures dN dSd(59 UTR) and d(39 UTR) of human-mouse sequencedivergence was detected (fig 9 and table 1) This findingstands in contrast with the results of the recent analysis ofthe connection between the sequence and expressiondivergence between paralogous human genes where asignificant negative correlation was observed betweenthe correlation coefficients of the expression profiles andsequence divergence (Makova and Li 2003) However anearlier study of yeast duplicate genes found no correlation

phyletic similarity

0008

001

0012

0014

0016

0018

002

-079 -066-027-022 -010 018 030 062 099 100

fract

ion

r gt=

07

FIG 7mdashCoexpressed human genes have similar phyletic patternsThe fractions of coexpressed gene pairs (r 07) are shown for 10ascending bins of phyletic similarity values (see Materials and Methods)

Table 1Evolutionary Rate Versus Gene Expression Comparisons

Comparison Ra Pb nc

dN vs expression breadth 21000 55endash7 3906dS vs expression breadth 20976 21endash5 390659 UTR d vs expression breadth 20539 ns 112139 UTR d vs expression breadth 20964 49endash5 1438

dN vs expression level 21000 55endash7 3906dS vs expression level 20952 11endash4 390659 UTR d vs expression level 20624 ns 112139 UTR d vs expression level 20964 49endash5 1438

dN vs number coexpressed genes 20964 49endash5 2307dS vs number coexpressed genes 20746 17endash2 230759 UTR d vs number coexpressedgenes 20588 ns 1119

39 UTR d vs number coexpressedgenes 20818 58endash3 1436

dN difference vs fraction r 07 20976 21endash5 2659971dS difference vs fraction r 07 20249 ns 265997159 UTR d difference vs fractionr 07 20812 58endash3 625521

39 UTR d difference vs fractionr 07 20867 22endash3 1030330

dN vs expression divergence 0139 ns 551dS vs expression divergence 20079 ns 55159 UTR d vs expression divergence 0164 ns 27339 UTR d vs expression divergence 20552 ns 382

a Spearman rank correlation coefficient for the average expression parameter

value against the 10 bins of substitution rate values (see figures 1 and 3ndash5)b Level of significance (n frac14 10) associated with R values ns frac14 nonsignifi-

cant (ie P 005)c The total number of comparisons that were binned

0

01

02

03

04

05

06

07

08

09

1

-1 -05 0 05 1r

cum

ulat

ive

frequ

ency

FIG 8mdashCumulative distributions of pairwise correlation coefficientvalues (r) The black line is the distribution for the comparisons ofnonorthologous genes within the human and mouse gene sets The grayline is the distribution for orthologous human-mouse gene comparisons

2066 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

between gene sequence and gene expression patterndivergence (Wagner 2000) The apparent disparity be-tween orthologs and paralogs in terms of expressionevolution in mammals seems to suggest unexpecteddifferences in the evolutionary modes and deserves furtherinvestigation

Conclusion

The results described here indicate that mammaliancoexpressed genes form a scale-free network whichprobably evolved by self-organization based on thepreferential attachment principle One possible mechanism

of this evolution is the duplication of genes together withtheir transcriptional control regions as suggested for theyeast coexpression network (Bhan Galas and Dewey2002) An examination of the contribution of paralogousgene pairs to the gene coexpression network analyzed hereindicates that such duplicated pairs (see Materials andMethods) are found approximately two times morefrequently than expected by chance (P 10210)However the fraction of coexpressed gene pairs that arerelated by duplication is small (028) suggesting thatduplication may not be a major force affecting thestructure of the gene coexpression network This findingseems to be consistent with an analysis of yeast protein

Table 2Examples of Two Human Gene Coexpression Network Clusters

Locus Link IDGenBank accessiona Gene Symbolb Annotationc Physiological Function

A Pancreas-Specific Gene Cluster

1208P16233 CLPS colipase preproprotein Pancreatic digestive enzyme1357NP_001859 CPA1 pancreatic carboxypeptidase A1 precursor Pancreatic digestive enzyme1358NP_001860 CPA2 pancreatic carboxypeptidase A2 precursor Pancreatic digestive enzyme2641P01275 GCG glucagon preproprotein Pancreatic islet hormone2813AAB19240 GP2 glycoprotein 2 (zymogen granule

membrane)The major membrane protein in thesecretory granule of theexocrine pancreas

2906AAB49992 GRIN2D N-methyl-D-aspartate receptor subunit2D precursor

Implicated in pancreatic hormonesecretionPMID 7768362

148223 NP_689695 hypothetical protein FLJ36666 Uncharacterized protein25803NP_036523 SPDEF SAM pointed domainndashcontaining ets

transcription factorProstate epithelium-specific transcriptionfactor

84444NP_115871 histone methyltransferase DOT1L Regulator of gene dynamics andchromatin activity

9436NP_004819 NCR2 natural cytotoxicity triggering receptorndash2 Mediator of killer cellsrsquocytotoxicity9610AAB67270 RIN1 ras inhibitor RIN1 A dominant negative inhibitor of the

RAS signaling pathway9753AAH41661 ZNF305 zinc finger proteinndash305 Predicted transcription regulator

A Testis-Specific Gene Cluster

6847Q15431 SYCP1 synaptonemal complex proteinndash1 Major component of the transversefilaments of synaptonemalcomplexes during meiotic prophase

10388 Q9BX26 SYCP2 synaptonemal complex proteinndash2 Major component of the axiallateralelements of synaptonemal complexesduring meiotic prophase

27285NP_055281 TEKT2 tektin 2 Major sperm microtubule protein8852NP_647450 AKAP4 A-kinase anchor proteinndash4 isoform 2 Major sperm sheath protein676 AAB87862 BRDT testis-specific bromodomain protein Testis-specific chromatin-associated protein

involved in chromatin remodeling11077 O75031 HSF2BP heat shock transcription factorndash2 binding

proteinImplicated in modulating HSF2activation in testis

7180P16562 CRISP2 testis-specific proteinndash1 Probable sperm-coating glycoprotein11116NP_919410 FGFR1OP FGFR1 oncogene partner isoform b Implicated in the proliferation and

differentiation of the erythroid lineage4291P58340 MLF1 myeloid leukemia factorndash1 Uncharacterized protein implicated in

cell proliferation5261AAH02541 PHKG2 phosphorylase kinase gamma 2 (testis) Testisliver-specific regulator of

glycogen metabolism54344 O94777 DPM3 dolichyl-phosphate mannosyltransferase

polypeptidendash3 isoform 2Regulator of dolichyl-phosphate mannosebiosynthesis Might be important forsperm maturation PMID 6231179

7288O00295 TULP2 tubby-like proteinndash2 Retina-specific and testis-specificheterotrimeric G-proteinndashresponsiveintracellular signaling factor

51460NP_057413 SFMBT1 Scm-like with four mbt domainsndash1 Polycomb group transcription regulator114049 NP_684281 WBSCR22 Williams Beuren syndrome chromosome

region 22 proteintRNA 5-cytosine methylase

a Locus Link identification numbers and GenBank accessions from NCBIrsquos LocusLink database httpwwwncbinlmnihgovLocusLinkindexhtmlb Official gene symbol from the Human Genome Organization (HUGO) httpwwwgeneuclacuknomenclaturec Official HUGO gene name

Evolution and the Human Gene Coexpression Network 2067

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

interaction networks which showed that the networkstructure is not predominantly shaped by duplication(Wagner 2003)

Our results show that overall levels of expressionand more specific topological properties of the gene co-expression network are clearly related to the rates ofsequence evolution highly connected network hubs tendto evolve slowly and genes that are coexpressed showa strong tendency to evolve at similar rates Theserelationships most likely reflect purifying selection whichappears to act primarily at the level of the proteinsequence Generally these results are compatible withthe notion of greater biological importance of highlyconnected nodes of biological networks (Jeong et al2001) However two of the findings reported here appearto be distinctly surprising Firstly we found that theconnection between gene coexpression network topologyand sequence evolution held for both nonsynonymous andsynonymous sites in the coding sequence and 39 UTR butnot for the 59 UTR Thus the constraints on 59 UTRevolution appear to be unrelated to expression regulationas analyzed here Secondly we found that although theexpression profiles of orthologs tend to be stronglycorrelated they diverged at roughly the same rate acrossa wide range of sequence evolution rates Thus unlike theproperties of the gene coexpression network evolution ofan individual genersquos expression regulation after speciationseems to be uncoupled from the functional constraints onthe genersquos sequence This is generally compatible withthe pregenomic notion that regulatory evolution could bethe decisive factor of biological diversification betweenspecies (Britten and Davidson 1969 King and Wilson1975) and can be taken to suggest that sequence diver-gence and expression pattern divergence are governed bydistinct forces

Recent studies have reported the rapid diversificationof gene expression patterns and suggested that this mightreflect neutral evolution of transcriptional regulation(Khaitovich et al 2004 Yanai Graur and Ophir 2004)The lack of correlation between sequence divergence andexpression profile divergence demonstrated here could bedeemed as compatible with the neutral model wherebytranscription profiles of orthologs diverge rapidly uponspeciation until they reach a basal level of similarity that isthen maintained by purifying selection However it istempting to speculate as to a distinct role for naturalselection in driving expression pattern divergence Perhapswhereas sequence divergence levels are determined largelyby purifying selection the effects of adaptive (diversifying)selection are more prevalent at the level of gene expressionUnder this hypothetical scenario although the basal level ofcorrelation between orthologs tends to persist probablyreflecting the conservation of the general function theaspects of the gene expression patterns that reflect species-specific changes governed in part by short degeneratetranscription factorndashbinding sites would be expected todiverge rapidly Indeed our analysis of the relationshipbetween gene coexpression in this work and geneduplication as well as previous results (Gu et al 2002)suggest rapid divergence of expression patterns afterduplication Once the expression patterns of homologous

correlation 01

02

03

04

05

06

001

002

003

004

005

007

009

012

019

052

01

02

03

04

05

06

040

046

050

055

058

062

065

071

079

127

01

02

03

04

05

06

017

024

032

037

045

053

063

072

088

137

01

02

03

04

05

06

017

023

029

034

040

044

050

059

073

120

3rsquo U

TR d

5rsquo

UTR

d

dNdS

FIG9mdash

Thelack

ofdependence

betweenhuman-m

ouse

generegulatory

divergence

andsequence

divergenceAveragecorrelationcoefficientvalues

(r)betweenexpressionprofilesofhuman-m

ouse

orthologousgeneareshownfor10ascendingbinsofhuman-m

ouse

orthologousgenesubstitutionrates

2068 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

genes (paralogs or orthologs) diverge convergent evolu-tion which is a hallmark of adaptation and is widely evidentfor phenotypic characters could cause unrelated genes toachieve similar expression patterns This convergence couldbe related to the rapid de novo generation of transcriptionfactorndashbinding sites in promoters that lead to slight changesin expression patterns If the functionally relevant aspectsof the ancestral expression pattern are maintained duringthis process subtle expression pattern changes would besimultaneously invisible to purifying selection and serve asthe raw material for repeated trials of adaptive selectionThis hypothesis yields a specific prediction with regard tothe evolution of promoter regions that is currently beinginvestigated Expression profiles are predicted to be morecorrelated with the distributions of specific transcriptionfactorndashbinding sites along promoters than with the overallsequence divergence (ie relatedness) between promoters

Acknowledgments

We thank Igor Rogozin David Landsman FyodorKondrashov Itai Yanai and Katerina Makova for helpfuldiscussions

Supplementary Material

Supplementary Figure 1mdashThe dependence betweenexpression breadth level and substitution rates of mousegenes (andashd) Average (6 standard error) expressionbreadth values for 10 ascending bins of human-mouseorthologous gene substitution rates (endashh) Average (6standard error) expression level values for 10 ascendingbins of human-mouse orthologous gene substitution rates

Supplementary Figure 2mdashFrequency distributions ofhuman-mouse substitution rate ratios Ratios shown inlog scale (a) d(59 UTR)d(39 UTR) d(59 UTR)d(39 UTR)is shown in log scale In 43 of genes d(59 UTR) d(39UTR) whereas in 57 d(39 UTR) d(59 UTR) (b)d(59 UTR)dS (c) d(39 UTRdS)

Supplementary Figure 3mdashFrequency distributions ofthe number of links per node in human gene coexpressionnetworks For each network the correlation coefficientvalue (r) threshold used to determine whether any twogenes (nodes) are considered to be coexpressed (linked)are shown above the plot The slope of the linear trend linethat fits the data and the r2 value indicating the goodness offit to the power law are shown for each network plot

Supplementary Figure 4mdashTissue-specific expressionpatterns for human gene coexpression network (r 09)clusters Cluster expression patterns 1 adult and fetal liver2 fetal liver 3 testis 4 pancreas 5 testis and umbilicalvein 6 various brain tissues 7 salivary gland 8 adultliver 9 thymus 10 spleen 11 lung 12 cerebellum

Supplementary Figure 5mdashRelationship between themouse coexpression network topology and substitutionrates (a) The dependence between the node degree andsubstitution rate Average (6 standard error) numbers ofcoexpressed genes (node degree for r 07) per gene for10 ascending bins of human-mouse orthologous gene sub-stitution rates are shown (b) Coexpressedmouse genes havesimilar substitution rates Fractions of pairwise correlation

coefficient values where r 07 between mouse genes areshown for 10 ascending bins of pairwise human-mouseorthologous gene substitution rate differences

Literature Cited

Agrawal H 2002 Extreme self-organization in networksconstructed from gene expression data Phys Rev Lett89268702

Ashburner M C A Ball J A Blake et al 2000 Geneontology tool for the unification of biology The GeneOntology Consortium Nat Genet 2525ndash29

Barabasi A L and R Albert 1999 Emergence of scaling inrandom networks Science 286509ndash512

Barabasi A L and Z N Oltvai 2004 Network biologyunderstanding the cellrsquos functional organization Nat RevGenet 5101ndash113

Bergmann S J Ihmels and N Barkai 2004 Similarities anddifferences in genome-wide expression data of six organismsPLoS Biol 20085ndash0093

Bhan A D J Galas and T G Dewey 2002 A duplicationgrowth model of gene expression networks Bioinformatics181486ndash1493

Bloom J D and C Adami 2003 Apparent dependence ofprotein evolutionary rate on number of interactions is linkedto biases in protein-protein interactions data sets BMC EvolBiol 321

Britten R J and E H Davidson 1969 Gene regulation forhigher cells a theory Science 165349ndash357

Carbone A A Zinovyev and F Kepes 2003 Codon adaptationindex as a measure of dominating codon bias Bioinformatics192005ndash2015

Darwin C 1859 On the origin of species John Murray LondonDuret L and D Mouchiroud 2000 Determinants of sub-

stitution rates in mammalian genes expression pattern affectsselection intensity but not mutation rate Mol Biol Evol1768ndash74

Eisen M B P T Spellman P O Brown and D Botstein1998 Cluster analysis and display of genome-wide expressionpatterns Proc Natl Acad Sci USA 9514863ndash14868

Farris J S 1977 Phylogenetic analysis under Dollorsquos law SystZool 2677ndash88

Fraser H B A E Hirsh L M Steinmetz C Scharfe andM W Feldman 2002 Evolutionary rate in the protein inter-action network Science 296750ndash752

Fraser H B D P Wall and A E Hirsh 2003 A simpledependence between protein evolution rate and the number ofprotein-protein interactions BMC Evol Biol 311

Giot L J S Bader C Brouwer et al 2003 A protein inter-action map of Drosophila melanogaster Science 30217271736

Gu Z D Nicolae H H Lu and W H Li 2002 Rapiddivergence in expression between duplicate genes inferredfrom microarray data Trends Genet 18609ndash613

Hedges S B H Chen S Kumar D Y Wang A S Thompsonand H Watanabe 2001 A genomic timescale for the origin ofeukaryotes BMC Evol Biol 14

Higgins D G J D Thompson and T J Gibson 1996 UsingCLUSTAL for multiple sequence alignments MethodsEnzymol 266383ndash402

Hirsh A E and H B Fraser 2001 Protein dispensability andrate of evolution Nature 4111046ndash1049

Ho Y A Gruhler A Heilbut et al 2002 Systematic iden-tification of protein complexes in Saccharomyces cerevisiaeby mass spectrometry Nature 415180ndash183

Evolution and the Human Gene Coexpression Network 2069

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Hoyle DCMRattray R Jupp andABrass 2002Making senseof microarray data distributions Bioinformatics 18576ndash584

Jeong H S P Mason A L Barabasi and Z N Oltvai 2001Lethality and centrality in protein networks Nature 41141ndash42

Jeong H B Tombor R Albert Z N Oltvai and A LBarabasi 2000 The large-scale organization of metabolicnetworks Nature 407651ndash654

Jordan I K I B Rogozin Y I Wolf and E V Koonin 2002Essential genes are more evolutionarily conserved than arenonessential genes in bacteria Genome Res 12962ndash968

Jordan I K Y I Wolf and E V Koonin 2003 No simpledependence between protein evolution rate and the number ofprotein-protein interactions only the most prolific interactorstend to evolve slowly BMC Evol Biol 31

Jukes T H and C R Cantor 1969 Evolution of proteinmolecules Pp 21ndash132 in H N Munro ed Mammalianprotein metabolism Academic New York

Kamath R S A G Fraser Y Dong et al 2003 Systematicfunctional analysis of the Caenorhabditis elegans genomeusing RNAi Nature 421231ndash237

Karev G P Y I Wolf A Y Rzhetsky F S Berezovskayaand E V Koonin 2002 Birth and death of protein domainsa simple model of evolution explains power law behaviorBMC Evol Biol 218

Khaitovich P G Weiss M Lachmann I Hellman W EnardB Muetzel U Wiekner W Ansorge and S Paabo 2004 Aneutral model of transcriptome evolution PLOS Biol20682ndash0689

King M C and A C Wilson 1975 Evolution at two levels inhumans and chimpanzees Science 188107ndash116

Koonin E V N D Fedorova J D Jackson et al 2004 Acomprehensive evolutionary classification of proteins encodedin complete eukaryotic genomes Genome Biol 5R7

Koonin E V Y I Wolf and G P Karev 2002 The structureof the protein universe and genome evolution Nature420218ndash223

Krylov D M Y I Wolf I B Rogozin and E V Koonin2003 Gene loss protein sequence divergence gene dispens-ability expression level and interactivity are correlated ineukaryotic evolution Genome Res 132229ndash2235

Kuznetsov V A G D Knott and R F Bonner 2002 Generalstatistics of stochastic process of gene expression ineukaryotic cells Genetics 1611321ndash1332

Lander E S L M Linton B Birren et al 2001 Initialsequencing and analysis of the human genome Nature 409860ndash921

Li S C M Armstrong N Bertin et al 2004 A map of theinteractome network of the metazoan C elegans Science 303540ndash543

Li W H 1997 Molecular evolution Sinauer AssociatesSunderland Mass

Luscombe N M J Qian Z Zhang T Johnson and MGerstein 2002 The dominance of the population by a selected

few power-law behaviour applies to a wide variety ofgenomic properties Genome Biol3RESEARCH0040

Makova K D and W H Li 2003 Divergence in the spatialpattern of gene expression between human duplicate genesGenome Res 131638ndash1645

Nei M and T Gojobori 1986 Simple methods for estimatingthe numbers of synonymous and nonsynonymous nucleotidesubstitutions Mol Biol Evol 3418ndash426

Pal C B Papp and L D Hurst 2001 Highly expressed genesin yeast evolve slowly Genetics 158927ndash931

mdashmdashmdash 2003 Genomic function rate of evolution and genedispensability Nature 421496ndash497

Pennisi E 2003 Systems biology tracing lifersquos circuitryScience 3021646ndash1649

Ravasz E A L Somera D A Mongru Z N Oltvai andA L Barabasi 2002 Hierarchical organization of modularityin metabolic networks Science 2971551ndash1555

Rzhetsky A and S M Gomez 2001 Birth of scale-freemolecular networks and the number of distinct DNA andprotein domains per genome Bioinformatics 17988ndash996

Su A I M P Cooke K A Ching et al 2002 Large-scaleanalysis of the human and mouse transcriptomes Proc NatlAcad Sci USA 994465ndash4470

Tatusov R L N D Fedorova J D Jackson et al 2003 TheCOG database an updated version includes eukaryotes BMCBioinformatics 441

Ueda H R S Hayshi S Matsuyama T Yomo S HashimotoS A Kay J B Hogenesch and M Iino 2004 Universalityand flexibility in gene expression from bacteria to humanProc Natl Acad Sci USA 1013765ndash3769

Wagner A 2000 Decoupled evolution of coding region andmRNA expression patterns after gene duplication implica-tions for the neutralist-selectionist debate Proc Natl AcadSci USA 976579ndash6584

mdashmdashmdash 2003 How the global structure of protein interaction net-works evolves Proc R Soc Lond B Biol Sci 270457ndash466

Waterston R H K Lindblad-Toh E Birney et al 2002 Initialsequencing and comparative analysis of the mouse genomeNature 420520ndash562

Yanai I D Graur and R Ophir 2004 Incongruent expressionprofiles between human and mouse orthologous genes suggestwidespread neutral evolution of transcription control OMICS815ndash24

Yang J ZGu andWHLi 2003Rate of protein evolution versusfitness effect of gene deletion Mol Biol Evol 20772ndash774

Yang Z 1997 PAML a programpackage for phylogenetic analysisby maximum likelihood Comput Appl Biosci 13555ndash556

Zhang L and W H Li 2004 Mammalian housekeeping genesevolve more slowly than tissue-specific genes Mol BiolEvol 21236ndash239

Peer Bork Associate Editor

Accepted July 14 2004

2070 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Page 3: Conservation and Coevolution in the Scale-Free Human Gene

Comparisons between substitution rates and variousgene expression parameters (expression breadth expressionlevel and correlations) were done by sorting the rates orrate differences in the ascending order and then binning thesorted values into 10 equal-sized bins Average and frac-tional values of gene expression parameters for each sub-stitution rate bin were compared with respect to the order ofthe bins and the Spearman rank correlation (R) along withits statistical significance (P-value at n frac14 10 for all com-parisons) was computed for each comparison (table 1)Regular correlation coefficients (rXY) and partial correla-tion coefficients (rXYZ) were Fisher Zndashtransformed andthe significance of the differences between them was calcu-lated using a z-test

Phyletic patterns for human genes were taken fromthe eukaryotic Clusters of Orthologous Groups of proteins(KOGs) database (Tatusov et al 2003 Koonin et al 2004)For each KOG its phyletic pattern corresponds to thepresenceabsence state of the KOG member proteinsamong the seven eukaryotic species included in thedatabase For each phyletic pattern a specific evolutionaryscenario was determined by mapping the states (presenceor absence) and events (gain or loss) to the species treeof the seven organisms in KOGs using the Dollo parsi-mony method (Farris 1977) Pairs of scenarios were thencompared to derive values of the phyletic similaritymeasure This was done by scoring the combinations ofstates and events on each branch of the species tree andsumming across all branches For each branch the specificcombination of states and events for a pair of scenarioswas considered with respect to the branch length andthe branch specific propensities for gene gain and loss Thebranch lengths (MYR) on the species tree were taken asdescribed previously (Hedges et al 2001 Krylov et al2003) The branch-specific relative propensities for genegain (PG

i ) and gene loss (PLi ) were calculated using the

total number of gene gains and losses mapped to eachbranch (Koonin et al 2004) according to the followingformula

PGi frac14 Gi

Xj

Nj

Ni

Xj

Gj PLi frac14 Li

Xj

Nj

Ni

Xj

Lj

where Ni Gi and Li are the numbers of present genesgains and losses mapped to the ith branch The scoringschemes for all possible combinations of states and eventsare shown at ftpftpncbinihgovpubkooninJordanMBE-04-0138R1SupplementaryTable4doc The sum ofscores across all branch lengths was divided by the sumof the respective normalization scores (see table 4 inSupplementary Material online) to provide the patternsimilarity scores (range from 21 to 1)

Results and DiscussionExpression Level and Breadth VersusSequence Divergence

A recent large-scale microarray study that includesquantitative analysis of gene expression patterns for 31

human and 46 mouse tissues yielded detailed profiles oftwo mammalian transcriptomes (Su et al 2002) We usedthese expression data along with comparative human-mouse gene sequence analysis to evaluate the relationshipbetween gene sequence evolution and gene expressiondivergence on a genomic scale For each human-mouseorthologous gene pair the number of tissues where it isexpressed (expression breadth) and total level of expres-sion were determined (see Materials and Methods) Thesevalues were compared to several measures of human-mouse orthologous gene sequence divergence the synon-ymous (dS) and nonsynonymous (dN) protein-codingsequence substitution rates as well as the substitution rates(d) for the 59 and 39 untranslated regions (UTRs) Asreported previously (Duret and Mouchiroud 2000 PalPapp and Hurst 2001 Zhang and Li 2003 Zhang and Li2004) genes that are more widely expressed and genesthat are more highly expressed are more evolutionarilyconserved (ie evolve more slowly) than genes withnarrower and lower overall levels of expression (fig 1 andtable 1 and see supplementary figure 1) The negativecorrelation between expression levels and substitutionrates was more pronounced for dN than for dS (fig 1 andtable 1 and see supplementary figure 1) This suggests thatrelatively slow evolution of highly expressed genesdepends more on the strict functional constraints on theprotein structure than on adaptation of codon usage forexpression level The connection between gene expressionand evolutionary rate was far stronger for the 39 UTRs thanfor the 59 UTRs (fig 1 and table 1 and see supplementaryfigure 1) This seems to indicate that on average 39 UTRscontain more cis-regulatory sites that are functionallyconstrained in highly expressed genes than do 59 UTRsOverall the 59 and 39 UTRs have similar rates ofevolution the median of the ratio d(59 UTR)d(39 UTR)is approximately 112 (see supplementary figure 2a)Furthermore both 39 UTRs and 59 UTRs were found toevolve somewhat slower on average than the synonymouspositions (see supplementary figures 2b and c) Thus 59UTRs appear to evolve under functional constraints thatare nearly as strong as those that affect the 39 UTRs andeven stronger than those for the synonymous positions ofthe coding region However for the 59 UTRs theseconstraints appear to be unrelated to expression breadthand level as measured here

Human Gene Coexpression Network

Tissue-specific expression patterns of human geneswere further compared to identify coexpressed genes Foreach differentially expressed human gene the normalizedexpression levels in each of the 31 tissues were used toconstruct a vector which was compared with similarlyderived vectors of other human genes using the Pearsoncorrelation coefficient (r) (Eisen et al 1998) Pairs of geneswith high r-values are considered to be coexpressedValues of r were determined for all pairs of human genesThese data were used to infer a network of coexpressedgenes where the genes are nodes that are connected by anedge if they share an r-value greater than or equal to aspecified threshold This was done using a series of r-value

2060 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

expr

essi

on le

vel

180

190

200

210

220

230

240

250

015 021 026 030 034 038 042 047 053 069180

190

200

210

220

230

240

250

001 002 003 004 005 007 009 012 018 164180

190

200

210

220

230

240

250

039 045 049 053 056 060 065 071 081 598180

190

200

210

220

230

240

250

015 020 024 028 031 034 037 042 048 062

dSdN 3rsquo UTR d

expr

essi

on b

read

th

6

8

10

12

14

16

18

001 002 003 004 005 007 009 012 018 1646

8

10

12

14

16

18

039 045 049 053 056 060 065 071 081 5986

8

10

12

14

16

18

015 021 026 030 034 038 042 047 053 0696

8

10

12

14

16

18

015 020 024 028 031 034 037 042 048 062

a

b

5rsquo UTR d

dSdN 3rsquo UTR d 5rsquo UTR d

FIG 1mdashThe dependence between expression breadth expression level and substitution rates of human genes (andashd) Average (6 standard error) human gene expression breadth values for 10 ascendingbins of human-mouse orthologous gene substitution rates (endashh) Average (6 standard error) human gene expression level values for 10 ascending bins of human-mouse orthologous gene substitution rates

Evolutio

nandtheHuman

GeneCoexpressio

nNetw

ork

2061

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

thresholds (09 2 04) and the topological properties ofthe resulting series of networks were investigated byanalyzing their node degree distributions that is thefrequency distributions of the number of genes f(n) thathave n coexpressed genes As the r-value thresholdincreases the number of edges in the network decreasesand the node degree distribution seems to tend to a powerlaw distribution (see supplementary figure 3) Node degreedistributions and graphic representations of the corre-sponding network topologies for r 09 08 and 07 areshown in figure 2 Because the distribution for r 07shows a good fit to a distribution with a power law tailwhile still retaining enough data for meaningful statisticalanalysis (fig 2 and see supplementary figure 3) thethreshold of 07 was chosen for further analysis A listof coexpressed gene pairs at r 07 is shown at ftpftpncbinihgovpubkooninJordanMBE-04ndash0138R1SupplementaryTable1tab

The node degree distribution for r 07 displays agood fit to a generalized Pareto distribution where f(n) rsquo(n 1 134)2117 (fig 3a) This distribution has a power lawtail which implies asymptotically scale-free properties(Barabasi and Albert 1999) Such scale-free networks haveno single characteristic node degree and are dominated bya small number of highly connected hubs (Barabasi andAlbert 1999) in this case genes that are coexpressed withmany other genes Similar scale-free properties have beenpreviously detected for node degree distributions of severalother gene expressionndashrelated parameters For instance

gene expression levels (Hoyle et al 2002 KuznetsovKnott and Bonner 2002 Luscombe et al 2002) as well aschanges in gene expression level (Ueda et al 2004) havebeen shown to follow power law distributions Coexpres-sion networks derived with different techniques for severalcancers (Agrawal 2002) and for yeast plants and animals(Bhan Galas and Dewey 2002 Bergmann Ihmels andBarkai 2004) also show scale-free properties Finallya number of other biological and other evolving networksincluding protein-protein interactions metabolic networkssocial contacts networks and the Internet (Barabasi andAlbert 1999 Jeong et al 2000 2001 Ravasz et al 2002Barabasi and Oltvai 2004) have node degree distributionsthat follow a power law and thus imply scale-freeproperties The principal mode of evolution of scale-freenetworks is thought to be preferential attachment of newnodes to those that are already highly connected (ie lsquolsquotherich get richerrsquorsquo or lsquolsquothe fit get fitterrsquorsquo model [Barabasi andAlbert 1999])

Obviously there is a transitive property to thecorrelation coefficients that are used to measure coex-pression and connect nodes in the coexpression network Ifthe tissue-specific expression pattern of gene A is cor-related with that of gene B and the pattern of gene B iscorrelated with that of gene C then one should expectexpression of gene A to be correlated with that of gene CHowever the level of correlation between A and C is anopen question because some groups of genes can be tightlycoregulated whereas others are only loosely coexpressed

FIG 2mdashThe human gene coexpression network Node degree distributions and network topologies are shown for networks where coexpressedgenes (nodes) are linked if (a) r 09 (b) r 08 and (c) r 07

2062 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

To further evaluate the topological properties of the humangene coexpression network the clustering of networknodes was analyzed The clustering coefficient (C) ofa given node is the ratio of the number of the actualconnections between the neighbors of the node to thenumber of possible connections between them (Barabasiand Oltvai 2004) The average C for the gene coexpressionnetwork is 0452 This indicates a high degree of clusteringthat is typical of many scale-free networks (Barabasi andOltvai 2004) however the fact that C 1 indicates onlylimited transitivity in the gene coexpression networkanalyzed here The shape of the dependence of C on thenode degree is thought to be indicative of the mode ofnetwork growth (Barabasi and Oltvai 2004) Furthermorea plot of the clustering coefficient C(n) versus nodedegree (n) shows the absence of any clear trend betweenthe two (fig 3b) C(n) rsquo constant as seen here isconsistent with network growth by simple preferentialattachment of nodes rather than growth by the hierarchicaladdition of modules that is thought to be characterized byC(n) rsquo n21 (Barabasi and Oltvai 2004)

The human gene coexpression network was examinedto assess the functional relationships between coexpressedgenes Pairs of coexpressed genes were functionallyclassified using the eukaryotic KOG database (Tatusov etal 2003) and the Gene Ontology database (Ashburner etal 2000) Using both of these approaches approximately175 of coexpressed gene pairs were found to encode

proteins with the same functional classification a signifi-cant (P 00001) but relatively small excess over therandom expectation (13) However far more strikingnonrandomness was observed when the coexpressed geneswere examined for co-occurrence of different functionscertain combinations of seemingly related functions werestrongly preferred (ftpftpncbinihgovpubkooninJordanMBE-04ndash0138R1SupplementaryTable2tab) Forexample signal transduction genes are coexpressed withthose involved in secretion transcription and posttransla-tional protein modification much more often than expectedby chance This is likely to reflect coexpression of genescoding for proteins that function together in biochemicaland signaling pathways

A number of highly connected individual clustersfrom the human gene coexpression network were furtherexamined with respect to the expression patterns andfunctions of their member genes The majority of theseclusters are made up of genes with narrow if not exclusivetissue-specific expression patterns (see supplementaryfigure 4) Examples of the topologies of two of theseclusters along with their tissue-specific expression patternsare shown in figure 4 The experimentally determined andpredicted functions for the member genes of these twoclusters are generally consistent with their expressionpatterns and indicate their involvement in pancreatic andtestis-related physiological functions respectively (table 2)

Conservation and Coevolution of CoexpressedHuman Genes

For each human gene with an identifiable mouseortholog the number of other human genes that it iscoexpressed with at r 07 (ftpftpncbinihgovpubkooninJordanMBE-04-0138R1SupplementaryTable3tab)was compared with its human-mouse substitution rate(fig 5 and table 1) Genes that have a higher number ofcoexpressed genes (the node degree in the transcriptionexpression network) are substantially more conservedthan genes with fewer coexpressed partners Qualitativelyidentical results are seen when the mouse expression data isused to compare the number of coexpressed mouse geneswith human-mouse evolutionary rates (see supplementaryfigure 5a) As with the relationship between expressionlevel and evolutionary rate the negative correlationbetween the number of coexpressed genes and evolutionaryrates was strongest for dN (fig 5) Thus the proteinproducts of those genes that are hubs of the transcriptioncoexpression network are subject to comparatively highlevels of functional constraint which could reflect theiressential roles in cellular processes (Hirsh and Fraser 2001Jordan et al 2002 Krylov et al 2003) multifunctionalityandor involvement in a large number of protein-proteininteractions (Fraser et al 2002 Jordan Wolf and Koonin2003) The negative correlation between dS and d(39 UTR)and the number of coexpressed genes (fig 5) is likelyto result from purifying selection maintaining tight trans-lational regulation of genes that are highly connected in thecoexpression network In contrast the correlation betweend(59 UTR) and the number of coexpressed genes was notsignificant (fig 5 and table 1)

1

10

100

1000a

b

1 10 100 1000node degree

fre

qu

en

cy

f(n) = 622 (n + 134)-117

P(χ2) ~ 021

00102030405060708

1 10 100 1000node degree

clu

ste

rin

g c

oe

ffic

ie

nt

FIG 3mdashStatistical properties of the human gene coexpressionnetwork (a) Node degree distribution Number of genes f(n) that haveexactly n coexpressed genes Equation and line for the best fittingdistribution are shown Probability associated with the v2 test for thegoodness of fit of the observed data to the theoretical generalized Paretodistribution is shown (b) Clustering coefficient (y-axis) plotted againstthe node degree (x-axis) Average clustering coefficients and standarddeviations of the averages for node degree bins are shown

Evolution and the Human Gene Coexpression Network 2063

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Although the correlation between dN and the numberof coexpressed genes was nearly as strong as that betweendN and expression breadth and level for dS the lattercorrelation was substantially stronger (table 1) This is notsurprising because codon usage adaptation is particularlyimportant for highly expressed genes (Carbone Zinovyevand Kepes 2003) Partial correlation rXYZ where X frac14number of coexpressed genes Y frac14 substitution rate andZfrac14 expression level was used to control for the effects ofexpression level on the relationship between the number ofcoexpressed genes and substitution rates In all caseswhere the substitution rate was found to be significantlycorrelated with the number of coexpressed genesmdashdNdS and d(39 UTR) (table 1)mdashthe application of partialcorrelation did not result in any significant decrease of thecorrelation coefficient that is rXYZ is not significantlysmaller than rXY (038 P 056) Thus the effect ofthe number of coexpressed genes on substitution rates isnot based on any correlation between the number ofcoexpressed genes and the expression level

The relationship between gene coexpression andevolutionary rate was further examined by comparing ther-values between pairs of human gene expression patternswith their pairwise gene substitution rate differences (fig 6and table 1) The results show that bins of human geneswith similar values of dN between human and mouse (iethose genes that evolve at similar rates) have a greaterfraction of pairwise r-values 07 This negative

correlation was substantial and statistically significant(fig 6 and table 1) In contrast the pairwise dS differencebetween human genes did not correlate with the differencein expression patterns (fig 6) Thus genes with similarpatterns of expression tend to evolve at similar rates andthe selection that leads to such coevolution appears tooperate primarily at the protein sequence level The dif-ference in evolution rates of 59 and 39 UTRs also neg-atively correlated with pairwise r-values between humangene expression patterns although the effect statisticallysignificant because of the large number of analyzed genepairs (table 1) was not nearly as pronounced as seen fordN (fig 6) As with other comparisons between evolu-tionary rate and expression patterns the magnitude of thisrelationship was greater for 39 UTRs suggesting that thisregion is more functionally relevant than the 59 UTR withrespect to the pattern of gene expression Qualitativelyidentical results are seen when the mouse expression datawas used to compare r-values between pairs of mousegenes with pairwise human-mouse gene substitution ratedifferences (see supplementary figure 5b)

The connection between gene coexpression andevolutionary conservation on a longer timescale wasinvestigated by considering the expression data alongwith the patterns of phyletic distribution inferred fromthe recently developed collection of eukaryotic KOGs(Tatusov et al 2003 Koonin et al 2004) Each humangene was mapped to a specific KOG and assigned

FIG 4mdashTwo highly connected gene expression clusters form the human gene coexpression network (a) Nodes correspond to genes and arelabeled with LocusLink ID numbers All links between genes correspond to coexpression at r 09 (b) Tissue-specific gene expression patterns of thecluster members Genes (rows) are labeled with LocusLink ID numbers Relative levels (log2 ratios) of expression are shown for each gene in eachtissue green cells show underexpression relative to the median black cells show median expression levels and red cells show overexpression

2064 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

node

deg

ree

0

10

20

30

40

50

60

001 002 003 004 005 007 009 012 018 1190

10

20

30

40

50

60

039 045 049 053 056 060 064 070 079 5980

10

20

30

40

50

60

015 021 026 030 034 038 042 048 053 0690

10

20

30

40

50

60

015 020 024 027 031 034 037 041 048 062

dSdN 3rsquo UTR d 5rsquo UTR d

FIG 5mdashThe dependence between the node degree in the human coexpression network and substitution rate Average (6 standard error) numbers of coexpressed human genes (r 07) per gene for 10ascending bins of human-mouse orthologous gene substitution rates are shown

fract

ion

r gt=

07

0008

0012

0016

002

001 002 003 004 005 007 009 013 019 1190008

0012

0016

002

003 006 008 012 015 019 024 030 041 5880008

0012

0016

002

005 010 015 021 027 034 042 053 069 1950008

0012

0016

002

004 008 012 016 021 026 033 042 056 133

FIG 6mdashCoexpressed human genes have similar substitution rates Fractions of pairwise correlation coefficient values where r 07 between human genes are shown for 10 ascending bins of pairwisehuman-mouse orthologous gene substitution rate differences

Evolutio

nandtheHuman

GeneCoexpressio

nNetw

ork

2065

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

a phyletic pattern that is the pattern of presenceabsenceof orthologous genes from the given KOG among seveneukaryotic species The phyletic patterns of KOGs werecompared to determine their similarity in terms ofpropensities for gene gain and loss over the approximately18 billion years of eukaryotic crown group evolution(Krylov et al 2003) A statistically significant positivecorrelation was detected between the similarity of evolu-tionary patterns of gain-loss and the level of coexpressionamong pairs of human genes (fig 7 and table 1) Thuscoexpressed genes tend to evolve under similar long-termevolutionary constraints

Human-Mouse Sequence Divergence Versus ExpressionProfile Divergence

Expression data collected for human and mouse werefurther compared to assess levels of regulatory divergence

(divergence of expression profiles) between orthologousgenes in the two species From the original series ofmicroarray experiments (Su et al 2002) 19 tissues thatwere studied in both human and mouse were identifiedRelative levels of gene expression across these tissues inboth species were taken as vectors and pairs of vectorsfor human-mouse orthologous genes were compared tomeasure the extent of regulatory divergence betweenspecies In a general agreement with the original report (Suet al 2002) the r-values for orthologs displayed a narrowdistribution with the median at approximately 04 Thiswas in sharp contrast to the distribution of r-values fornonorthologous human and mouse genes which hada median of approximately 0 (fig 8) The differencebetween the two distributions is highly statisticallysignificant (P 10210) Surprisingly however whenthe resulting r-values were compared with the levels ofsequence divergence to determine whether sequence andregulatory divergence were correlated no significantcorrelation between the human-mouse expression patterndivergence and the four different measures dN dSd(59 UTR) and d(39 UTR) of human-mouse sequencedivergence was detected (fig 9 and table 1) This findingstands in contrast with the results of the recent analysis ofthe connection between the sequence and expressiondivergence between paralogous human genes where asignificant negative correlation was observed betweenthe correlation coefficients of the expression profiles andsequence divergence (Makova and Li 2003) However anearlier study of yeast duplicate genes found no correlation

phyletic similarity

0008

001

0012

0014

0016

0018

002

-079 -066-027-022 -010 018 030 062 099 100

fract

ion

r gt=

07

FIG 7mdashCoexpressed human genes have similar phyletic patternsThe fractions of coexpressed gene pairs (r 07) are shown for 10ascending bins of phyletic similarity values (see Materials and Methods)

Table 1Evolutionary Rate Versus Gene Expression Comparisons

Comparison Ra Pb nc

dN vs expression breadth 21000 55endash7 3906dS vs expression breadth 20976 21endash5 390659 UTR d vs expression breadth 20539 ns 112139 UTR d vs expression breadth 20964 49endash5 1438

dN vs expression level 21000 55endash7 3906dS vs expression level 20952 11endash4 390659 UTR d vs expression level 20624 ns 112139 UTR d vs expression level 20964 49endash5 1438

dN vs number coexpressed genes 20964 49endash5 2307dS vs number coexpressed genes 20746 17endash2 230759 UTR d vs number coexpressedgenes 20588 ns 1119

39 UTR d vs number coexpressedgenes 20818 58endash3 1436

dN difference vs fraction r 07 20976 21endash5 2659971dS difference vs fraction r 07 20249 ns 265997159 UTR d difference vs fractionr 07 20812 58endash3 625521

39 UTR d difference vs fractionr 07 20867 22endash3 1030330

dN vs expression divergence 0139 ns 551dS vs expression divergence 20079 ns 55159 UTR d vs expression divergence 0164 ns 27339 UTR d vs expression divergence 20552 ns 382

a Spearman rank correlation coefficient for the average expression parameter

value against the 10 bins of substitution rate values (see figures 1 and 3ndash5)b Level of significance (n frac14 10) associated with R values ns frac14 nonsignifi-

cant (ie P 005)c The total number of comparisons that were binned

0

01

02

03

04

05

06

07

08

09

1

-1 -05 0 05 1r

cum

ulat

ive

frequ

ency

FIG 8mdashCumulative distributions of pairwise correlation coefficientvalues (r) The black line is the distribution for the comparisons ofnonorthologous genes within the human and mouse gene sets The grayline is the distribution for orthologous human-mouse gene comparisons

2066 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

between gene sequence and gene expression patterndivergence (Wagner 2000) The apparent disparity be-tween orthologs and paralogs in terms of expressionevolution in mammals seems to suggest unexpecteddifferences in the evolutionary modes and deserves furtherinvestigation

Conclusion

The results described here indicate that mammaliancoexpressed genes form a scale-free network whichprobably evolved by self-organization based on thepreferential attachment principle One possible mechanism

of this evolution is the duplication of genes together withtheir transcriptional control regions as suggested for theyeast coexpression network (Bhan Galas and Dewey2002) An examination of the contribution of paralogousgene pairs to the gene coexpression network analyzed hereindicates that such duplicated pairs (see Materials andMethods) are found approximately two times morefrequently than expected by chance (P 10210)However the fraction of coexpressed gene pairs that arerelated by duplication is small (028) suggesting thatduplication may not be a major force affecting thestructure of the gene coexpression network This findingseems to be consistent with an analysis of yeast protein

Table 2Examples of Two Human Gene Coexpression Network Clusters

Locus Link IDGenBank accessiona Gene Symbolb Annotationc Physiological Function

A Pancreas-Specific Gene Cluster

1208P16233 CLPS colipase preproprotein Pancreatic digestive enzyme1357NP_001859 CPA1 pancreatic carboxypeptidase A1 precursor Pancreatic digestive enzyme1358NP_001860 CPA2 pancreatic carboxypeptidase A2 precursor Pancreatic digestive enzyme2641P01275 GCG glucagon preproprotein Pancreatic islet hormone2813AAB19240 GP2 glycoprotein 2 (zymogen granule

membrane)The major membrane protein in thesecretory granule of theexocrine pancreas

2906AAB49992 GRIN2D N-methyl-D-aspartate receptor subunit2D precursor

Implicated in pancreatic hormonesecretionPMID 7768362

148223 NP_689695 hypothetical protein FLJ36666 Uncharacterized protein25803NP_036523 SPDEF SAM pointed domainndashcontaining ets

transcription factorProstate epithelium-specific transcriptionfactor

84444NP_115871 histone methyltransferase DOT1L Regulator of gene dynamics andchromatin activity

9436NP_004819 NCR2 natural cytotoxicity triggering receptorndash2 Mediator of killer cellsrsquocytotoxicity9610AAB67270 RIN1 ras inhibitor RIN1 A dominant negative inhibitor of the

RAS signaling pathway9753AAH41661 ZNF305 zinc finger proteinndash305 Predicted transcription regulator

A Testis-Specific Gene Cluster

6847Q15431 SYCP1 synaptonemal complex proteinndash1 Major component of the transversefilaments of synaptonemalcomplexes during meiotic prophase

10388 Q9BX26 SYCP2 synaptonemal complex proteinndash2 Major component of the axiallateralelements of synaptonemal complexesduring meiotic prophase

27285NP_055281 TEKT2 tektin 2 Major sperm microtubule protein8852NP_647450 AKAP4 A-kinase anchor proteinndash4 isoform 2 Major sperm sheath protein676 AAB87862 BRDT testis-specific bromodomain protein Testis-specific chromatin-associated protein

involved in chromatin remodeling11077 O75031 HSF2BP heat shock transcription factorndash2 binding

proteinImplicated in modulating HSF2activation in testis

7180P16562 CRISP2 testis-specific proteinndash1 Probable sperm-coating glycoprotein11116NP_919410 FGFR1OP FGFR1 oncogene partner isoform b Implicated in the proliferation and

differentiation of the erythroid lineage4291P58340 MLF1 myeloid leukemia factorndash1 Uncharacterized protein implicated in

cell proliferation5261AAH02541 PHKG2 phosphorylase kinase gamma 2 (testis) Testisliver-specific regulator of

glycogen metabolism54344 O94777 DPM3 dolichyl-phosphate mannosyltransferase

polypeptidendash3 isoform 2Regulator of dolichyl-phosphate mannosebiosynthesis Might be important forsperm maturation PMID 6231179

7288O00295 TULP2 tubby-like proteinndash2 Retina-specific and testis-specificheterotrimeric G-proteinndashresponsiveintracellular signaling factor

51460NP_057413 SFMBT1 Scm-like with four mbt domainsndash1 Polycomb group transcription regulator114049 NP_684281 WBSCR22 Williams Beuren syndrome chromosome

region 22 proteintRNA 5-cytosine methylase

a Locus Link identification numbers and GenBank accessions from NCBIrsquos LocusLink database httpwwwncbinlmnihgovLocusLinkindexhtmlb Official gene symbol from the Human Genome Organization (HUGO) httpwwwgeneuclacuknomenclaturec Official HUGO gene name

Evolution and the Human Gene Coexpression Network 2067

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

interaction networks which showed that the networkstructure is not predominantly shaped by duplication(Wagner 2003)

Our results show that overall levels of expressionand more specific topological properties of the gene co-expression network are clearly related to the rates ofsequence evolution highly connected network hubs tendto evolve slowly and genes that are coexpressed showa strong tendency to evolve at similar rates Theserelationships most likely reflect purifying selection whichappears to act primarily at the level of the proteinsequence Generally these results are compatible withthe notion of greater biological importance of highlyconnected nodes of biological networks (Jeong et al2001) However two of the findings reported here appearto be distinctly surprising Firstly we found that theconnection between gene coexpression network topologyand sequence evolution held for both nonsynonymous andsynonymous sites in the coding sequence and 39 UTR butnot for the 59 UTR Thus the constraints on 59 UTRevolution appear to be unrelated to expression regulationas analyzed here Secondly we found that although theexpression profiles of orthologs tend to be stronglycorrelated they diverged at roughly the same rate acrossa wide range of sequence evolution rates Thus unlike theproperties of the gene coexpression network evolution ofan individual genersquos expression regulation after speciationseems to be uncoupled from the functional constraints onthe genersquos sequence This is generally compatible withthe pregenomic notion that regulatory evolution could bethe decisive factor of biological diversification betweenspecies (Britten and Davidson 1969 King and Wilson1975) and can be taken to suggest that sequence diver-gence and expression pattern divergence are governed bydistinct forces

Recent studies have reported the rapid diversificationof gene expression patterns and suggested that this mightreflect neutral evolution of transcriptional regulation(Khaitovich et al 2004 Yanai Graur and Ophir 2004)The lack of correlation between sequence divergence andexpression profile divergence demonstrated here could bedeemed as compatible with the neutral model wherebytranscription profiles of orthologs diverge rapidly uponspeciation until they reach a basal level of similarity that isthen maintained by purifying selection However it istempting to speculate as to a distinct role for naturalselection in driving expression pattern divergence Perhapswhereas sequence divergence levels are determined largelyby purifying selection the effects of adaptive (diversifying)selection are more prevalent at the level of gene expressionUnder this hypothetical scenario although the basal level ofcorrelation between orthologs tends to persist probablyreflecting the conservation of the general function theaspects of the gene expression patterns that reflect species-specific changes governed in part by short degeneratetranscription factorndashbinding sites would be expected todiverge rapidly Indeed our analysis of the relationshipbetween gene coexpression in this work and geneduplication as well as previous results (Gu et al 2002)suggest rapid divergence of expression patterns afterduplication Once the expression patterns of homologous

correlation 01

02

03

04

05

06

001

002

003

004

005

007

009

012

019

052

01

02

03

04

05

06

040

046

050

055

058

062

065

071

079

127

01

02

03

04

05

06

017

024

032

037

045

053

063

072

088

137

01

02

03

04

05

06

017

023

029

034

040

044

050

059

073

120

3rsquo U

TR d

5rsquo

UTR

d

dNdS

FIG9mdash

Thelack

ofdependence

betweenhuman-m

ouse

generegulatory

divergence

andsequence

divergenceAveragecorrelationcoefficientvalues

(r)betweenexpressionprofilesofhuman-m

ouse

orthologousgeneareshownfor10ascendingbinsofhuman-m

ouse

orthologousgenesubstitutionrates

2068 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

genes (paralogs or orthologs) diverge convergent evolu-tion which is a hallmark of adaptation and is widely evidentfor phenotypic characters could cause unrelated genes toachieve similar expression patterns This convergence couldbe related to the rapid de novo generation of transcriptionfactorndashbinding sites in promoters that lead to slight changesin expression patterns If the functionally relevant aspectsof the ancestral expression pattern are maintained duringthis process subtle expression pattern changes would besimultaneously invisible to purifying selection and serve asthe raw material for repeated trials of adaptive selectionThis hypothesis yields a specific prediction with regard tothe evolution of promoter regions that is currently beinginvestigated Expression profiles are predicted to be morecorrelated with the distributions of specific transcriptionfactorndashbinding sites along promoters than with the overallsequence divergence (ie relatedness) between promoters

Acknowledgments

We thank Igor Rogozin David Landsman FyodorKondrashov Itai Yanai and Katerina Makova for helpfuldiscussions

Supplementary Material

Supplementary Figure 1mdashThe dependence betweenexpression breadth level and substitution rates of mousegenes (andashd) Average (6 standard error) expressionbreadth values for 10 ascending bins of human-mouseorthologous gene substitution rates (endashh) Average (6standard error) expression level values for 10 ascendingbins of human-mouse orthologous gene substitution rates

Supplementary Figure 2mdashFrequency distributions ofhuman-mouse substitution rate ratios Ratios shown inlog scale (a) d(59 UTR)d(39 UTR) d(59 UTR)d(39 UTR)is shown in log scale In 43 of genes d(59 UTR) d(39UTR) whereas in 57 d(39 UTR) d(59 UTR) (b)d(59 UTR)dS (c) d(39 UTRdS)

Supplementary Figure 3mdashFrequency distributions ofthe number of links per node in human gene coexpressionnetworks For each network the correlation coefficientvalue (r) threshold used to determine whether any twogenes (nodes) are considered to be coexpressed (linked)are shown above the plot The slope of the linear trend linethat fits the data and the r2 value indicating the goodness offit to the power law are shown for each network plot

Supplementary Figure 4mdashTissue-specific expressionpatterns for human gene coexpression network (r 09)clusters Cluster expression patterns 1 adult and fetal liver2 fetal liver 3 testis 4 pancreas 5 testis and umbilicalvein 6 various brain tissues 7 salivary gland 8 adultliver 9 thymus 10 spleen 11 lung 12 cerebellum

Supplementary Figure 5mdashRelationship between themouse coexpression network topology and substitutionrates (a) The dependence between the node degree andsubstitution rate Average (6 standard error) numbers ofcoexpressed genes (node degree for r 07) per gene for10 ascending bins of human-mouse orthologous gene sub-stitution rates are shown (b) Coexpressedmouse genes havesimilar substitution rates Fractions of pairwise correlation

coefficient values where r 07 between mouse genes areshown for 10 ascending bins of pairwise human-mouseorthologous gene substitution rate differences

Literature Cited

Agrawal H 2002 Extreme self-organization in networksconstructed from gene expression data Phys Rev Lett89268702

Ashburner M C A Ball J A Blake et al 2000 Geneontology tool for the unification of biology The GeneOntology Consortium Nat Genet 2525ndash29

Barabasi A L and R Albert 1999 Emergence of scaling inrandom networks Science 286509ndash512

Barabasi A L and Z N Oltvai 2004 Network biologyunderstanding the cellrsquos functional organization Nat RevGenet 5101ndash113

Bergmann S J Ihmels and N Barkai 2004 Similarities anddifferences in genome-wide expression data of six organismsPLoS Biol 20085ndash0093

Bhan A D J Galas and T G Dewey 2002 A duplicationgrowth model of gene expression networks Bioinformatics181486ndash1493

Bloom J D and C Adami 2003 Apparent dependence ofprotein evolutionary rate on number of interactions is linkedto biases in protein-protein interactions data sets BMC EvolBiol 321

Britten R J and E H Davidson 1969 Gene regulation forhigher cells a theory Science 165349ndash357

Carbone A A Zinovyev and F Kepes 2003 Codon adaptationindex as a measure of dominating codon bias Bioinformatics192005ndash2015

Darwin C 1859 On the origin of species John Murray LondonDuret L and D Mouchiroud 2000 Determinants of sub-

stitution rates in mammalian genes expression pattern affectsselection intensity but not mutation rate Mol Biol Evol1768ndash74

Eisen M B P T Spellman P O Brown and D Botstein1998 Cluster analysis and display of genome-wide expressionpatterns Proc Natl Acad Sci USA 9514863ndash14868

Farris J S 1977 Phylogenetic analysis under Dollorsquos law SystZool 2677ndash88

Fraser H B A E Hirsh L M Steinmetz C Scharfe andM W Feldman 2002 Evolutionary rate in the protein inter-action network Science 296750ndash752

Fraser H B D P Wall and A E Hirsh 2003 A simpledependence between protein evolution rate and the number ofprotein-protein interactions BMC Evol Biol 311

Giot L J S Bader C Brouwer et al 2003 A protein inter-action map of Drosophila melanogaster Science 30217271736

Gu Z D Nicolae H H Lu and W H Li 2002 Rapiddivergence in expression between duplicate genes inferredfrom microarray data Trends Genet 18609ndash613

Hedges S B H Chen S Kumar D Y Wang A S Thompsonand H Watanabe 2001 A genomic timescale for the origin ofeukaryotes BMC Evol Biol 14

Higgins D G J D Thompson and T J Gibson 1996 UsingCLUSTAL for multiple sequence alignments MethodsEnzymol 266383ndash402

Hirsh A E and H B Fraser 2001 Protein dispensability andrate of evolution Nature 4111046ndash1049

Ho Y A Gruhler A Heilbut et al 2002 Systematic iden-tification of protein complexes in Saccharomyces cerevisiaeby mass spectrometry Nature 415180ndash183

Evolution and the Human Gene Coexpression Network 2069

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Hoyle DCMRattray R Jupp andABrass 2002Making senseof microarray data distributions Bioinformatics 18576ndash584

Jeong H S P Mason A L Barabasi and Z N Oltvai 2001Lethality and centrality in protein networks Nature 41141ndash42

Jeong H B Tombor R Albert Z N Oltvai and A LBarabasi 2000 The large-scale organization of metabolicnetworks Nature 407651ndash654

Jordan I K I B Rogozin Y I Wolf and E V Koonin 2002Essential genes are more evolutionarily conserved than arenonessential genes in bacteria Genome Res 12962ndash968

Jordan I K Y I Wolf and E V Koonin 2003 No simpledependence between protein evolution rate and the number ofprotein-protein interactions only the most prolific interactorstend to evolve slowly BMC Evol Biol 31

Jukes T H and C R Cantor 1969 Evolution of proteinmolecules Pp 21ndash132 in H N Munro ed Mammalianprotein metabolism Academic New York

Kamath R S A G Fraser Y Dong et al 2003 Systematicfunctional analysis of the Caenorhabditis elegans genomeusing RNAi Nature 421231ndash237

Karev G P Y I Wolf A Y Rzhetsky F S Berezovskayaand E V Koonin 2002 Birth and death of protein domainsa simple model of evolution explains power law behaviorBMC Evol Biol 218

Khaitovich P G Weiss M Lachmann I Hellman W EnardB Muetzel U Wiekner W Ansorge and S Paabo 2004 Aneutral model of transcriptome evolution PLOS Biol20682ndash0689

King M C and A C Wilson 1975 Evolution at two levels inhumans and chimpanzees Science 188107ndash116

Koonin E V N D Fedorova J D Jackson et al 2004 Acomprehensive evolutionary classification of proteins encodedin complete eukaryotic genomes Genome Biol 5R7

Koonin E V Y I Wolf and G P Karev 2002 The structureof the protein universe and genome evolution Nature420218ndash223

Krylov D M Y I Wolf I B Rogozin and E V Koonin2003 Gene loss protein sequence divergence gene dispens-ability expression level and interactivity are correlated ineukaryotic evolution Genome Res 132229ndash2235

Kuznetsov V A G D Knott and R F Bonner 2002 Generalstatistics of stochastic process of gene expression ineukaryotic cells Genetics 1611321ndash1332

Lander E S L M Linton B Birren et al 2001 Initialsequencing and analysis of the human genome Nature 409860ndash921

Li S C M Armstrong N Bertin et al 2004 A map of theinteractome network of the metazoan C elegans Science 303540ndash543

Li W H 1997 Molecular evolution Sinauer AssociatesSunderland Mass

Luscombe N M J Qian Z Zhang T Johnson and MGerstein 2002 The dominance of the population by a selected

few power-law behaviour applies to a wide variety ofgenomic properties Genome Biol3RESEARCH0040

Makova K D and W H Li 2003 Divergence in the spatialpattern of gene expression between human duplicate genesGenome Res 131638ndash1645

Nei M and T Gojobori 1986 Simple methods for estimatingthe numbers of synonymous and nonsynonymous nucleotidesubstitutions Mol Biol Evol 3418ndash426

Pal C B Papp and L D Hurst 2001 Highly expressed genesin yeast evolve slowly Genetics 158927ndash931

mdashmdashmdash 2003 Genomic function rate of evolution and genedispensability Nature 421496ndash497

Pennisi E 2003 Systems biology tracing lifersquos circuitryScience 3021646ndash1649

Ravasz E A L Somera D A Mongru Z N Oltvai andA L Barabasi 2002 Hierarchical organization of modularityin metabolic networks Science 2971551ndash1555

Rzhetsky A and S M Gomez 2001 Birth of scale-freemolecular networks and the number of distinct DNA andprotein domains per genome Bioinformatics 17988ndash996

Su A I M P Cooke K A Ching et al 2002 Large-scaleanalysis of the human and mouse transcriptomes Proc NatlAcad Sci USA 994465ndash4470

Tatusov R L N D Fedorova J D Jackson et al 2003 TheCOG database an updated version includes eukaryotes BMCBioinformatics 441

Ueda H R S Hayshi S Matsuyama T Yomo S HashimotoS A Kay J B Hogenesch and M Iino 2004 Universalityand flexibility in gene expression from bacteria to humanProc Natl Acad Sci USA 1013765ndash3769

Wagner A 2000 Decoupled evolution of coding region andmRNA expression patterns after gene duplication implica-tions for the neutralist-selectionist debate Proc Natl AcadSci USA 976579ndash6584

mdashmdashmdash 2003 How the global structure of protein interaction net-works evolves Proc R Soc Lond B Biol Sci 270457ndash466

Waterston R H K Lindblad-Toh E Birney et al 2002 Initialsequencing and comparative analysis of the mouse genomeNature 420520ndash562

Yanai I D Graur and R Ophir 2004 Incongruent expressionprofiles between human and mouse orthologous genes suggestwidespread neutral evolution of transcription control OMICS815ndash24

Yang J ZGu andWHLi 2003Rate of protein evolution versusfitness effect of gene deletion Mol Biol Evol 20772ndash774

Yang Z 1997 PAML a programpackage for phylogenetic analysisby maximum likelihood Comput Appl Biosci 13555ndash556

Zhang L and W H Li 2004 Mammalian housekeeping genesevolve more slowly than tissue-specific genes Mol BiolEvol 21236ndash239

Peer Bork Associate Editor

Accepted July 14 2004

2070 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Page 4: Conservation and Coevolution in the Scale-Free Human Gene

expr

essi

on le

vel

180

190

200

210

220

230

240

250

015 021 026 030 034 038 042 047 053 069180

190

200

210

220

230

240

250

001 002 003 004 005 007 009 012 018 164180

190

200

210

220

230

240

250

039 045 049 053 056 060 065 071 081 598180

190

200

210

220

230

240

250

015 020 024 028 031 034 037 042 048 062

dSdN 3rsquo UTR d

expr

essi

on b

read

th

6

8

10

12

14

16

18

001 002 003 004 005 007 009 012 018 1646

8

10

12

14

16

18

039 045 049 053 056 060 065 071 081 5986

8

10

12

14

16

18

015 021 026 030 034 038 042 047 053 0696

8

10

12

14

16

18

015 020 024 028 031 034 037 042 048 062

a

b

5rsquo UTR d

dSdN 3rsquo UTR d 5rsquo UTR d

FIG 1mdashThe dependence between expression breadth expression level and substitution rates of human genes (andashd) Average (6 standard error) human gene expression breadth values for 10 ascendingbins of human-mouse orthologous gene substitution rates (endashh) Average (6 standard error) human gene expression level values for 10 ascending bins of human-mouse orthologous gene substitution rates

Evolutio

nandtheHuman

GeneCoexpressio

nNetw

ork

2061

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

thresholds (09 2 04) and the topological properties ofthe resulting series of networks were investigated byanalyzing their node degree distributions that is thefrequency distributions of the number of genes f(n) thathave n coexpressed genes As the r-value thresholdincreases the number of edges in the network decreasesand the node degree distribution seems to tend to a powerlaw distribution (see supplementary figure 3) Node degreedistributions and graphic representations of the corre-sponding network topologies for r 09 08 and 07 areshown in figure 2 Because the distribution for r 07shows a good fit to a distribution with a power law tailwhile still retaining enough data for meaningful statisticalanalysis (fig 2 and see supplementary figure 3) thethreshold of 07 was chosen for further analysis A listof coexpressed gene pairs at r 07 is shown at ftpftpncbinihgovpubkooninJordanMBE-04ndash0138R1SupplementaryTable1tab

The node degree distribution for r 07 displays agood fit to a generalized Pareto distribution where f(n) rsquo(n 1 134)2117 (fig 3a) This distribution has a power lawtail which implies asymptotically scale-free properties(Barabasi and Albert 1999) Such scale-free networks haveno single characteristic node degree and are dominated bya small number of highly connected hubs (Barabasi andAlbert 1999) in this case genes that are coexpressed withmany other genes Similar scale-free properties have beenpreviously detected for node degree distributions of severalother gene expressionndashrelated parameters For instance

gene expression levels (Hoyle et al 2002 KuznetsovKnott and Bonner 2002 Luscombe et al 2002) as well aschanges in gene expression level (Ueda et al 2004) havebeen shown to follow power law distributions Coexpres-sion networks derived with different techniques for severalcancers (Agrawal 2002) and for yeast plants and animals(Bhan Galas and Dewey 2002 Bergmann Ihmels andBarkai 2004) also show scale-free properties Finallya number of other biological and other evolving networksincluding protein-protein interactions metabolic networkssocial contacts networks and the Internet (Barabasi andAlbert 1999 Jeong et al 2000 2001 Ravasz et al 2002Barabasi and Oltvai 2004) have node degree distributionsthat follow a power law and thus imply scale-freeproperties The principal mode of evolution of scale-freenetworks is thought to be preferential attachment of newnodes to those that are already highly connected (ie lsquolsquotherich get richerrsquorsquo or lsquolsquothe fit get fitterrsquorsquo model [Barabasi andAlbert 1999])

Obviously there is a transitive property to thecorrelation coefficients that are used to measure coex-pression and connect nodes in the coexpression network Ifthe tissue-specific expression pattern of gene A is cor-related with that of gene B and the pattern of gene B iscorrelated with that of gene C then one should expectexpression of gene A to be correlated with that of gene CHowever the level of correlation between A and C is anopen question because some groups of genes can be tightlycoregulated whereas others are only loosely coexpressed

FIG 2mdashThe human gene coexpression network Node degree distributions and network topologies are shown for networks where coexpressedgenes (nodes) are linked if (a) r 09 (b) r 08 and (c) r 07

2062 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

To further evaluate the topological properties of the humangene coexpression network the clustering of networknodes was analyzed The clustering coefficient (C) ofa given node is the ratio of the number of the actualconnections between the neighbors of the node to thenumber of possible connections between them (Barabasiand Oltvai 2004) The average C for the gene coexpressionnetwork is 0452 This indicates a high degree of clusteringthat is typical of many scale-free networks (Barabasi andOltvai 2004) however the fact that C 1 indicates onlylimited transitivity in the gene coexpression networkanalyzed here The shape of the dependence of C on thenode degree is thought to be indicative of the mode ofnetwork growth (Barabasi and Oltvai 2004) Furthermorea plot of the clustering coefficient C(n) versus nodedegree (n) shows the absence of any clear trend betweenthe two (fig 3b) C(n) rsquo constant as seen here isconsistent with network growth by simple preferentialattachment of nodes rather than growth by the hierarchicaladdition of modules that is thought to be characterized byC(n) rsquo n21 (Barabasi and Oltvai 2004)

The human gene coexpression network was examinedto assess the functional relationships between coexpressedgenes Pairs of coexpressed genes were functionallyclassified using the eukaryotic KOG database (Tatusov etal 2003) and the Gene Ontology database (Ashburner etal 2000) Using both of these approaches approximately175 of coexpressed gene pairs were found to encode

proteins with the same functional classification a signifi-cant (P 00001) but relatively small excess over therandom expectation (13) However far more strikingnonrandomness was observed when the coexpressed geneswere examined for co-occurrence of different functionscertain combinations of seemingly related functions werestrongly preferred (ftpftpncbinihgovpubkooninJordanMBE-04ndash0138R1SupplementaryTable2tab) Forexample signal transduction genes are coexpressed withthose involved in secretion transcription and posttransla-tional protein modification much more often than expectedby chance This is likely to reflect coexpression of genescoding for proteins that function together in biochemicaland signaling pathways

A number of highly connected individual clustersfrom the human gene coexpression network were furtherexamined with respect to the expression patterns andfunctions of their member genes The majority of theseclusters are made up of genes with narrow if not exclusivetissue-specific expression patterns (see supplementaryfigure 4) Examples of the topologies of two of theseclusters along with their tissue-specific expression patternsare shown in figure 4 The experimentally determined andpredicted functions for the member genes of these twoclusters are generally consistent with their expressionpatterns and indicate their involvement in pancreatic andtestis-related physiological functions respectively (table 2)

Conservation and Coevolution of CoexpressedHuman Genes

For each human gene with an identifiable mouseortholog the number of other human genes that it iscoexpressed with at r 07 (ftpftpncbinihgovpubkooninJordanMBE-04-0138R1SupplementaryTable3tab)was compared with its human-mouse substitution rate(fig 5 and table 1) Genes that have a higher number ofcoexpressed genes (the node degree in the transcriptionexpression network) are substantially more conservedthan genes with fewer coexpressed partners Qualitativelyidentical results are seen when the mouse expression data isused to compare the number of coexpressed mouse geneswith human-mouse evolutionary rates (see supplementaryfigure 5a) As with the relationship between expressionlevel and evolutionary rate the negative correlationbetween the number of coexpressed genes and evolutionaryrates was strongest for dN (fig 5) Thus the proteinproducts of those genes that are hubs of the transcriptioncoexpression network are subject to comparatively highlevels of functional constraint which could reflect theiressential roles in cellular processes (Hirsh and Fraser 2001Jordan et al 2002 Krylov et al 2003) multifunctionalityandor involvement in a large number of protein-proteininteractions (Fraser et al 2002 Jordan Wolf and Koonin2003) The negative correlation between dS and d(39 UTR)and the number of coexpressed genes (fig 5) is likelyto result from purifying selection maintaining tight trans-lational regulation of genes that are highly connected in thecoexpression network In contrast the correlation betweend(59 UTR) and the number of coexpressed genes was notsignificant (fig 5 and table 1)

1

10

100

1000a

b

1 10 100 1000node degree

fre

qu

en

cy

f(n) = 622 (n + 134)-117

P(χ2) ~ 021

00102030405060708

1 10 100 1000node degree

clu

ste

rin

g c

oe

ffic

ie

nt

FIG 3mdashStatistical properties of the human gene coexpressionnetwork (a) Node degree distribution Number of genes f(n) that haveexactly n coexpressed genes Equation and line for the best fittingdistribution are shown Probability associated with the v2 test for thegoodness of fit of the observed data to the theoretical generalized Paretodistribution is shown (b) Clustering coefficient (y-axis) plotted againstthe node degree (x-axis) Average clustering coefficients and standarddeviations of the averages for node degree bins are shown

Evolution and the Human Gene Coexpression Network 2063

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Although the correlation between dN and the numberof coexpressed genes was nearly as strong as that betweendN and expression breadth and level for dS the lattercorrelation was substantially stronger (table 1) This is notsurprising because codon usage adaptation is particularlyimportant for highly expressed genes (Carbone Zinovyevand Kepes 2003) Partial correlation rXYZ where X frac14number of coexpressed genes Y frac14 substitution rate andZfrac14 expression level was used to control for the effects ofexpression level on the relationship between the number ofcoexpressed genes and substitution rates In all caseswhere the substitution rate was found to be significantlycorrelated with the number of coexpressed genesmdashdNdS and d(39 UTR) (table 1)mdashthe application of partialcorrelation did not result in any significant decrease of thecorrelation coefficient that is rXYZ is not significantlysmaller than rXY (038 P 056) Thus the effect ofthe number of coexpressed genes on substitution rates isnot based on any correlation between the number ofcoexpressed genes and the expression level

The relationship between gene coexpression andevolutionary rate was further examined by comparing ther-values between pairs of human gene expression patternswith their pairwise gene substitution rate differences (fig 6and table 1) The results show that bins of human geneswith similar values of dN between human and mouse (iethose genes that evolve at similar rates) have a greaterfraction of pairwise r-values 07 This negative

correlation was substantial and statistically significant(fig 6 and table 1) In contrast the pairwise dS differencebetween human genes did not correlate with the differencein expression patterns (fig 6) Thus genes with similarpatterns of expression tend to evolve at similar rates andthe selection that leads to such coevolution appears tooperate primarily at the protein sequence level The dif-ference in evolution rates of 59 and 39 UTRs also neg-atively correlated with pairwise r-values between humangene expression patterns although the effect statisticallysignificant because of the large number of analyzed genepairs (table 1) was not nearly as pronounced as seen fordN (fig 6) As with other comparisons between evolu-tionary rate and expression patterns the magnitude of thisrelationship was greater for 39 UTRs suggesting that thisregion is more functionally relevant than the 59 UTR withrespect to the pattern of gene expression Qualitativelyidentical results are seen when the mouse expression datawas used to compare r-values between pairs of mousegenes with pairwise human-mouse gene substitution ratedifferences (see supplementary figure 5b)

The connection between gene coexpression andevolutionary conservation on a longer timescale wasinvestigated by considering the expression data alongwith the patterns of phyletic distribution inferred fromthe recently developed collection of eukaryotic KOGs(Tatusov et al 2003 Koonin et al 2004) Each humangene was mapped to a specific KOG and assigned

FIG 4mdashTwo highly connected gene expression clusters form the human gene coexpression network (a) Nodes correspond to genes and arelabeled with LocusLink ID numbers All links between genes correspond to coexpression at r 09 (b) Tissue-specific gene expression patterns of thecluster members Genes (rows) are labeled with LocusLink ID numbers Relative levels (log2 ratios) of expression are shown for each gene in eachtissue green cells show underexpression relative to the median black cells show median expression levels and red cells show overexpression

2064 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

node

deg

ree

0

10

20

30

40

50

60

001 002 003 004 005 007 009 012 018 1190

10

20

30

40

50

60

039 045 049 053 056 060 064 070 079 5980

10

20

30

40

50

60

015 021 026 030 034 038 042 048 053 0690

10

20

30

40

50

60

015 020 024 027 031 034 037 041 048 062

dSdN 3rsquo UTR d 5rsquo UTR d

FIG 5mdashThe dependence between the node degree in the human coexpression network and substitution rate Average (6 standard error) numbers of coexpressed human genes (r 07) per gene for 10ascending bins of human-mouse orthologous gene substitution rates are shown

fract

ion

r gt=

07

0008

0012

0016

002

001 002 003 004 005 007 009 013 019 1190008

0012

0016

002

003 006 008 012 015 019 024 030 041 5880008

0012

0016

002

005 010 015 021 027 034 042 053 069 1950008

0012

0016

002

004 008 012 016 021 026 033 042 056 133

FIG 6mdashCoexpressed human genes have similar substitution rates Fractions of pairwise correlation coefficient values where r 07 between human genes are shown for 10 ascending bins of pairwisehuman-mouse orthologous gene substitution rate differences

Evolutio

nandtheHuman

GeneCoexpressio

nNetw

ork

2065

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

a phyletic pattern that is the pattern of presenceabsenceof orthologous genes from the given KOG among seveneukaryotic species The phyletic patterns of KOGs werecompared to determine their similarity in terms ofpropensities for gene gain and loss over the approximately18 billion years of eukaryotic crown group evolution(Krylov et al 2003) A statistically significant positivecorrelation was detected between the similarity of evolu-tionary patterns of gain-loss and the level of coexpressionamong pairs of human genes (fig 7 and table 1) Thuscoexpressed genes tend to evolve under similar long-termevolutionary constraints

Human-Mouse Sequence Divergence Versus ExpressionProfile Divergence

Expression data collected for human and mouse werefurther compared to assess levels of regulatory divergence

(divergence of expression profiles) between orthologousgenes in the two species From the original series ofmicroarray experiments (Su et al 2002) 19 tissues thatwere studied in both human and mouse were identifiedRelative levels of gene expression across these tissues inboth species were taken as vectors and pairs of vectorsfor human-mouse orthologous genes were compared tomeasure the extent of regulatory divergence betweenspecies In a general agreement with the original report (Suet al 2002) the r-values for orthologs displayed a narrowdistribution with the median at approximately 04 Thiswas in sharp contrast to the distribution of r-values fornonorthologous human and mouse genes which hada median of approximately 0 (fig 8) The differencebetween the two distributions is highly statisticallysignificant (P 10210) Surprisingly however whenthe resulting r-values were compared with the levels ofsequence divergence to determine whether sequence andregulatory divergence were correlated no significantcorrelation between the human-mouse expression patterndivergence and the four different measures dN dSd(59 UTR) and d(39 UTR) of human-mouse sequencedivergence was detected (fig 9 and table 1) This findingstands in contrast with the results of the recent analysis ofthe connection between the sequence and expressiondivergence between paralogous human genes where asignificant negative correlation was observed betweenthe correlation coefficients of the expression profiles andsequence divergence (Makova and Li 2003) However anearlier study of yeast duplicate genes found no correlation

phyletic similarity

0008

001

0012

0014

0016

0018

002

-079 -066-027-022 -010 018 030 062 099 100

fract

ion

r gt=

07

FIG 7mdashCoexpressed human genes have similar phyletic patternsThe fractions of coexpressed gene pairs (r 07) are shown for 10ascending bins of phyletic similarity values (see Materials and Methods)

Table 1Evolutionary Rate Versus Gene Expression Comparisons

Comparison Ra Pb nc

dN vs expression breadth 21000 55endash7 3906dS vs expression breadth 20976 21endash5 390659 UTR d vs expression breadth 20539 ns 112139 UTR d vs expression breadth 20964 49endash5 1438

dN vs expression level 21000 55endash7 3906dS vs expression level 20952 11endash4 390659 UTR d vs expression level 20624 ns 112139 UTR d vs expression level 20964 49endash5 1438

dN vs number coexpressed genes 20964 49endash5 2307dS vs number coexpressed genes 20746 17endash2 230759 UTR d vs number coexpressedgenes 20588 ns 1119

39 UTR d vs number coexpressedgenes 20818 58endash3 1436

dN difference vs fraction r 07 20976 21endash5 2659971dS difference vs fraction r 07 20249 ns 265997159 UTR d difference vs fractionr 07 20812 58endash3 625521

39 UTR d difference vs fractionr 07 20867 22endash3 1030330

dN vs expression divergence 0139 ns 551dS vs expression divergence 20079 ns 55159 UTR d vs expression divergence 0164 ns 27339 UTR d vs expression divergence 20552 ns 382

a Spearman rank correlation coefficient for the average expression parameter

value against the 10 bins of substitution rate values (see figures 1 and 3ndash5)b Level of significance (n frac14 10) associated with R values ns frac14 nonsignifi-

cant (ie P 005)c The total number of comparisons that were binned

0

01

02

03

04

05

06

07

08

09

1

-1 -05 0 05 1r

cum

ulat

ive

frequ

ency

FIG 8mdashCumulative distributions of pairwise correlation coefficientvalues (r) The black line is the distribution for the comparisons ofnonorthologous genes within the human and mouse gene sets The grayline is the distribution for orthologous human-mouse gene comparisons

2066 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

between gene sequence and gene expression patterndivergence (Wagner 2000) The apparent disparity be-tween orthologs and paralogs in terms of expressionevolution in mammals seems to suggest unexpecteddifferences in the evolutionary modes and deserves furtherinvestigation

Conclusion

The results described here indicate that mammaliancoexpressed genes form a scale-free network whichprobably evolved by self-organization based on thepreferential attachment principle One possible mechanism

of this evolution is the duplication of genes together withtheir transcriptional control regions as suggested for theyeast coexpression network (Bhan Galas and Dewey2002) An examination of the contribution of paralogousgene pairs to the gene coexpression network analyzed hereindicates that such duplicated pairs (see Materials andMethods) are found approximately two times morefrequently than expected by chance (P 10210)However the fraction of coexpressed gene pairs that arerelated by duplication is small (028) suggesting thatduplication may not be a major force affecting thestructure of the gene coexpression network This findingseems to be consistent with an analysis of yeast protein

Table 2Examples of Two Human Gene Coexpression Network Clusters

Locus Link IDGenBank accessiona Gene Symbolb Annotationc Physiological Function

A Pancreas-Specific Gene Cluster

1208P16233 CLPS colipase preproprotein Pancreatic digestive enzyme1357NP_001859 CPA1 pancreatic carboxypeptidase A1 precursor Pancreatic digestive enzyme1358NP_001860 CPA2 pancreatic carboxypeptidase A2 precursor Pancreatic digestive enzyme2641P01275 GCG glucagon preproprotein Pancreatic islet hormone2813AAB19240 GP2 glycoprotein 2 (zymogen granule

membrane)The major membrane protein in thesecretory granule of theexocrine pancreas

2906AAB49992 GRIN2D N-methyl-D-aspartate receptor subunit2D precursor

Implicated in pancreatic hormonesecretionPMID 7768362

148223 NP_689695 hypothetical protein FLJ36666 Uncharacterized protein25803NP_036523 SPDEF SAM pointed domainndashcontaining ets

transcription factorProstate epithelium-specific transcriptionfactor

84444NP_115871 histone methyltransferase DOT1L Regulator of gene dynamics andchromatin activity

9436NP_004819 NCR2 natural cytotoxicity triggering receptorndash2 Mediator of killer cellsrsquocytotoxicity9610AAB67270 RIN1 ras inhibitor RIN1 A dominant negative inhibitor of the

RAS signaling pathway9753AAH41661 ZNF305 zinc finger proteinndash305 Predicted transcription regulator

A Testis-Specific Gene Cluster

6847Q15431 SYCP1 synaptonemal complex proteinndash1 Major component of the transversefilaments of synaptonemalcomplexes during meiotic prophase

10388 Q9BX26 SYCP2 synaptonemal complex proteinndash2 Major component of the axiallateralelements of synaptonemal complexesduring meiotic prophase

27285NP_055281 TEKT2 tektin 2 Major sperm microtubule protein8852NP_647450 AKAP4 A-kinase anchor proteinndash4 isoform 2 Major sperm sheath protein676 AAB87862 BRDT testis-specific bromodomain protein Testis-specific chromatin-associated protein

involved in chromatin remodeling11077 O75031 HSF2BP heat shock transcription factorndash2 binding

proteinImplicated in modulating HSF2activation in testis

7180P16562 CRISP2 testis-specific proteinndash1 Probable sperm-coating glycoprotein11116NP_919410 FGFR1OP FGFR1 oncogene partner isoform b Implicated in the proliferation and

differentiation of the erythroid lineage4291P58340 MLF1 myeloid leukemia factorndash1 Uncharacterized protein implicated in

cell proliferation5261AAH02541 PHKG2 phosphorylase kinase gamma 2 (testis) Testisliver-specific regulator of

glycogen metabolism54344 O94777 DPM3 dolichyl-phosphate mannosyltransferase

polypeptidendash3 isoform 2Regulator of dolichyl-phosphate mannosebiosynthesis Might be important forsperm maturation PMID 6231179

7288O00295 TULP2 tubby-like proteinndash2 Retina-specific and testis-specificheterotrimeric G-proteinndashresponsiveintracellular signaling factor

51460NP_057413 SFMBT1 Scm-like with four mbt domainsndash1 Polycomb group transcription regulator114049 NP_684281 WBSCR22 Williams Beuren syndrome chromosome

region 22 proteintRNA 5-cytosine methylase

a Locus Link identification numbers and GenBank accessions from NCBIrsquos LocusLink database httpwwwncbinlmnihgovLocusLinkindexhtmlb Official gene symbol from the Human Genome Organization (HUGO) httpwwwgeneuclacuknomenclaturec Official HUGO gene name

Evolution and the Human Gene Coexpression Network 2067

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

interaction networks which showed that the networkstructure is not predominantly shaped by duplication(Wagner 2003)

Our results show that overall levels of expressionand more specific topological properties of the gene co-expression network are clearly related to the rates ofsequence evolution highly connected network hubs tendto evolve slowly and genes that are coexpressed showa strong tendency to evolve at similar rates Theserelationships most likely reflect purifying selection whichappears to act primarily at the level of the proteinsequence Generally these results are compatible withthe notion of greater biological importance of highlyconnected nodes of biological networks (Jeong et al2001) However two of the findings reported here appearto be distinctly surprising Firstly we found that theconnection between gene coexpression network topologyand sequence evolution held for both nonsynonymous andsynonymous sites in the coding sequence and 39 UTR butnot for the 59 UTR Thus the constraints on 59 UTRevolution appear to be unrelated to expression regulationas analyzed here Secondly we found that although theexpression profiles of orthologs tend to be stronglycorrelated they diverged at roughly the same rate acrossa wide range of sequence evolution rates Thus unlike theproperties of the gene coexpression network evolution ofan individual genersquos expression regulation after speciationseems to be uncoupled from the functional constraints onthe genersquos sequence This is generally compatible withthe pregenomic notion that regulatory evolution could bethe decisive factor of biological diversification betweenspecies (Britten and Davidson 1969 King and Wilson1975) and can be taken to suggest that sequence diver-gence and expression pattern divergence are governed bydistinct forces

Recent studies have reported the rapid diversificationof gene expression patterns and suggested that this mightreflect neutral evolution of transcriptional regulation(Khaitovich et al 2004 Yanai Graur and Ophir 2004)The lack of correlation between sequence divergence andexpression profile divergence demonstrated here could bedeemed as compatible with the neutral model wherebytranscription profiles of orthologs diverge rapidly uponspeciation until they reach a basal level of similarity that isthen maintained by purifying selection However it istempting to speculate as to a distinct role for naturalselection in driving expression pattern divergence Perhapswhereas sequence divergence levels are determined largelyby purifying selection the effects of adaptive (diversifying)selection are more prevalent at the level of gene expressionUnder this hypothetical scenario although the basal level ofcorrelation between orthologs tends to persist probablyreflecting the conservation of the general function theaspects of the gene expression patterns that reflect species-specific changes governed in part by short degeneratetranscription factorndashbinding sites would be expected todiverge rapidly Indeed our analysis of the relationshipbetween gene coexpression in this work and geneduplication as well as previous results (Gu et al 2002)suggest rapid divergence of expression patterns afterduplication Once the expression patterns of homologous

correlation 01

02

03

04

05

06

001

002

003

004

005

007

009

012

019

052

01

02

03

04

05

06

040

046

050

055

058

062

065

071

079

127

01

02

03

04

05

06

017

024

032

037

045

053

063

072

088

137

01

02

03

04

05

06

017

023

029

034

040

044

050

059

073

120

3rsquo U

TR d

5rsquo

UTR

d

dNdS

FIG9mdash

Thelack

ofdependence

betweenhuman-m

ouse

generegulatory

divergence

andsequence

divergenceAveragecorrelationcoefficientvalues

(r)betweenexpressionprofilesofhuman-m

ouse

orthologousgeneareshownfor10ascendingbinsofhuman-m

ouse

orthologousgenesubstitutionrates

2068 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

genes (paralogs or orthologs) diverge convergent evolu-tion which is a hallmark of adaptation and is widely evidentfor phenotypic characters could cause unrelated genes toachieve similar expression patterns This convergence couldbe related to the rapid de novo generation of transcriptionfactorndashbinding sites in promoters that lead to slight changesin expression patterns If the functionally relevant aspectsof the ancestral expression pattern are maintained duringthis process subtle expression pattern changes would besimultaneously invisible to purifying selection and serve asthe raw material for repeated trials of adaptive selectionThis hypothesis yields a specific prediction with regard tothe evolution of promoter regions that is currently beinginvestigated Expression profiles are predicted to be morecorrelated with the distributions of specific transcriptionfactorndashbinding sites along promoters than with the overallsequence divergence (ie relatedness) between promoters

Acknowledgments

We thank Igor Rogozin David Landsman FyodorKondrashov Itai Yanai and Katerina Makova for helpfuldiscussions

Supplementary Material

Supplementary Figure 1mdashThe dependence betweenexpression breadth level and substitution rates of mousegenes (andashd) Average (6 standard error) expressionbreadth values for 10 ascending bins of human-mouseorthologous gene substitution rates (endashh) Average (6standard error) expression level values for 10 ascendingbins of human-mouse orthologous gene substitution rates

Supplementary Figure 2mdashFrequency distributions ofhuman-mouse substitution rate ratios Ratios shown inlog scale (a) d(59 UTR)d(39 UTR) d(59 UTR)d(39 UTR)is shown in log scale In 43 of genes d(59 UTR) d(39UTR) whereas in 57 d(39 UTR) d(59 UTR) (b)d(59 UTR)dS (c) d(39 UTRdS)

Supplementary Figure 3mdashFrequency distributions ofthe number of links per node in human gene coexpressionnetworks For each network the correlation coefficientvalue (r) threshold used to determine whether any twogenes (nodes) are considered to be coexpressed (linked)are shown above the plot The slope of the linear trend linethat fits the data and the r2 value indicating the goodness offit to the power law are shown for each network plot

Supplementary Figure 4mdashTissue-specific expressionpatterns for human gene coexpression network (r 09)clusters Cluster expression patterns 1 adult and fetal liver2 fetal liver 3 testis 4 pancreas 5 testis and umbilicalvein 6 various brain tissues 7 salivary gland 8 adultliver 9 thymus 10 spleen 11 lung 12 cerebellum

Supplementary Figure 5mdashRelationship between themouse coexpression network topology and substitutionrates (a) The dependence between the node degree andsubstitution rate Average (6 standard error) numbers ofcoexpressed genes (node degree for r 07) per gene for10 ascending bins of human-mouse orthologous gene sub-stitution rates are shown (b) Coexpressedmouse genes havesimilar substitution rates Fractions of pairwise correlation

coefficient values where r 07 between mouse genes areshown for 10 ascending bins of pairwise human-mouseorthologous gene substitution rate differences

Literature Cited

Agrawal H 2002 Extreme self-organization in networksconstructed from gene expression data Phys Rev Lett89268702

Ashburner M C A Ball J A Blake et al 2000 Geneontology tool for the unification of biology The GeneOntology Consortium Nat Genet 2525ndash29

Barabasi A L and R Albert 1999 Emergence of scaling inrandom networks Science 286509ndash512

Barabasi A L and Z N Oltvai 2004 Network biologyunderstanding the cellrsquos functional organization Nat RevGenet 5101ndash113

Bergmann S J Ihmels and N Barkai 2004 Similarities anddifferences in genome-wide expression data of six organismsPLoS Biol 20085ndash0093

Bhan A D J Galas and T G Dewey 2002 A duplicationgrowth model of gene expression networks Bioinformatics181486ndash1493

Bloom J D and C Adami 2003 Apparent dependence ofprotein evolutionary rate on number of interactions is linkedto biases in protein-protein interactions data sets BMC EvolBiol 321

Britten R J and E H Davidson 1969 Gene regulation forhigher cells a theory Science 165349ndash357

Carbone A A Zinovyev and F Kepes 2003 Codon adaptationindex as a measure of dominating codon bias Bioinformatics192005ndash2015

Darwin C 1859 On the origin of species John Murray LondonDuret L and D Mouchiroud 2000 Determinants of sub-

stitution rates in mammalian genes expression pattern affectsselection intensity but not mutation rate Mol Biol Evol1768ndash74

Eisen M B P T Spellman P O Brown and D Botstein1998 Cluster analysis and display of genome-wide expressionpatterns Proc Natl Acad Sci USA 9514863ndash14868

Farris J S 1977 Phylogenetic analysis under Dollorsquos law SystZool 2677ndash88

Fraser H B A E Hirsh L M Steinmetz C Scharfe andM W Feldman 2002 Evolutionary rate in the protein inter-action network Science 296750ndash752

Fraser H B D P Wall and A E Hirsh 2003 A simpledependence between protein evolution rate and the number ofprotein-protein interactions BMC Evol Biol 311

Giot L J S Bader C Brouwer et al 2003 A protein inter-action map of Drosophila melanogaster Science 30217271736

Gu Z D Nicolae H H Lu and W H Li 2002 Rapiddivergence in expression between duplicate genes inferredfrom microarray data Trends Genet 18609ndash613

Hedges S B H Chen S Kumar D Y Wang A S Thompsonand H Watanabe 2001 A genomic timescale for the origin ofeukaryotes BMC Evol Biol 14

Higgins D G J D Thompson and T J Gibson 1996 UsingCLUSTAL for multiple sequence alignments MethodsEnzymol 266383ndash402

Hirsh A E and H B Fraser 2001 Protein dispensability andrate of evolution Nature 4111046ndash1049

Ho Y A Gruhler A Heilbut et al 2002 Systematic iden-tification of protein complexes in Saccharomyces cerevisiaeby mass spectrometry Nature 415180ndash183

Evolution and the Human Gene Coexpression Network 2069

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Hoyle DCMRattray R Jupp andABrass 2002Making senseof microarray data distributions Bioinformatics 18576ndash584

Jeong H S P Mason A L Barabasi and Z N Oltvai 2001Lethality and centrality in protein networks Nature 41141ndash42

Jeong H B Tombor R Albert Z N Oltvai and A LBarabasi 2000 The large-scale organization of metabolicnetworks Nature 407651ndash654

Jordan I K I B Rogozin Y I Wolf and E V Koonin 2002Essential genes are more evolutionarily conserved than arenonessential genes in bacteria Genome Res 12962ndash968

Jordan I K Y I Wolf and E V Koonin 2003 No simpledependence between protein evolution rate and the number ofprotein-protein interactions only the most prolific interactorstend to evolve slowly BMC Evol Biol 31

Jukes T H and C R Cantor 1969 Evolution of proteinmolecules Pp 21ndash132 in H N Munro ed Mammalianprotein metabolism Academic New York

Kamath R S A G Fraser Y Dong et al 2003 Systematicfunctional analysis of the Caenorhabditis elegans genomeusing RNAi Nature 421231ndash237

Karev G P Y I Wolf A Y Rzhetsky F S Berezovskayaand E V Koonin 2002 Birth and death of protein domainsa simple model of evolution explains power law behaviorBMC Evol Biol 218

Khaitovich P G Weiss M Lachmann I Hellman W EnardB Muetzel U Wiekner W Ansorge and S Paabo 2004 Aneutral model of transcriptome evolution PLOS Biol20682ndash0689

King M C and A C Wilson 1975 Evolution at two levels inhumans and chimpanzees Science 188107ndash116

Koonin E V N D Fedorova J D Jackson et al 2004 Acomprehensive evolutionary classification of proteins encodedin complete eukaryotic genomes Genome Biol 5R7

Koonin E V Y I Wolf and G P Karev 2002 The structureof the protein universe and genome evolution Nature420218ndash223

Krylov D M Y I Wolf I B Rogozin and E V Koonin2003 Gene loss protein sequence divergence gene dispens-ability expression level and interactivity are correlated ineukaryotic evolution Genome Res 132229ndash2235

Kuznetsov V A G D Knott and R F Bonner 2002 Generalstatistics of stochastic process of gene expression ineukaryotic cells Genetics 1611321ndash1332

Lander E S L M Linton B Birren et al 2001 Initialsequencing and analysis of the human genome Nature 409860ndash921

Li S C M Armstrong N Bertin et al 2004 A map of theinteractome network of the metazoan C elegans Science 303540ndash543

Li W H 1997 Molecular evolution Sinauer AssociatesSunderland Mass

Luscombe N M J Qian Z Zhang T Johnson and MGerstein 2002 The dominance of the population by a selected

few power-law behaviour applies to a wide variety ofgenomic properties Genome Biol3RESEARCH0040

Makova K D and W H Li 2003 Divergence in the spatialpattern of gene expression between human duplicate genesGenome Res 131638ndash1645

Nei M and T Gojobori 1986 Simple methods for estimatingthe numbers of synonymous and nonsynonymous nucleotidesubstitutions Mol Biol Evol 3418ndash426

Pal C B Papp and L D Hurst 2001 Highly expressed genesin yeast evolve slowly Genetics 158927ndash931

mdashmdashmdash 2003 Genomic function rate of evolution and genedispensability Nature 421496ndash497

Pennisi E 2003 Systems biology tracing lifersquos circuitryScience 3021646ndash1649

Ravasz E A L Somera D A Mongru Z N Oltvai andA L Barabasi 2002 Hierarchical organization of modularityin metabolic networks Science 2971551ndash1555

Rzhetsky A and S M Gomez 2001 Birth of scale-freemolecular networks and the number of distinct DNA andprotein domains per genome Bioinformatics 17988ndash996

Su A I M P Cooke K A Ching et al 2002 Large-scaleanalysis of the human and mouse transcriptomes Proc NatlAcad Sci USA 994465ndash4470

Tatusov R L N D Fedorova J D Jackson et al 2003 TheCOG database an updated version includes eukaryotes BMCBioinformatics 441

Ueda H R S Hayshi S Matsuyama T Yomo S HashimotoS A Kay J B Hogenesch and M Iino 2004 Universalityand flexibility in gene expression from bacteria to humanProc Natl Acad Sci USA 1013765ndash3769

Wagner A 2000 Decoupled evolution of coding region andmRNA expression patterns after gene duplication implica-tions for the neutralist-selectionist debate Proc Natl AcadSci USA 976579ndash6584

mdashmdashmdash 2003 How the global structure of protein interaction net-works evolves Proc R Soc Lond B Biol Sci 270457ndash466

Waterston R H K Lindblad-Toh E Birney et al 2002 Initialsequencing and comparative analysis of the mouse genomeNature 420520ndash562

Yanai I D Graur and R Ophir 2004 Incongruent expressionprofiles between human and mouse orthologous genes suggestwidespread neutral evolution of transcription control OMICS815ndash24

Yang J ZGu andWHLi 2003Rate of protein evolution versusfitness effect of gene deletion Mol Biol Evol 20772ndash774

Yang Z 1997 PAML a programpackage for phylogenetic analysisby maximum likelihood Comput Appl Biosci 13555ndash556

Zhang L and W H Li 2004 Mammalian housekeeping genesevolve more slowly than tissue-specific genes Mol BiolEvol 21236ndash239

Peer Bork Associate Editor

Accepted July 14 2004

2070 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Page 5: Conservation and Coevolution in the Scale-Free Human Gene

thresholds (09 2 04) and the topological properties ofthe resulting series of networks were investigated byanalyzing their node degree distributions that is thefrequency distributions of the number of genes f(n) thathave n coexpressed genes As the r-value thresholdincreases the number of edges in the network decreasesand the node degree distribution seems to tend to a powerlaw distribution (see supplementary figure 3) Node degreedistributions and graphic representations of the corre-sponding network topologies for r 09 08 and 07 areshown in figure 2 Because the distribution for r 07shows a good fit to a distribution with a power law tailwhile still retaining enough data for meaningful statisticalanalysis (fig 2 and see supplementary figure 3) thethreshold of 07 was chosen for further analysis A listof coexpressed gene pairs at r 07 is shown at ftpftpncbinihgovpubkooninJordanMBE-04ndash0138R1SupplementaryTable1tab

The node degree distribution for r 07 displays agood fit to a generalized Pareto distribution where f(n) rsquo(n 1 134)2117 (fig 3a) This distribution has a power lawtail which implies asymptotically scale-free properties(Barabasi and Albert 1999) Such scale-free networks haveno single characteristic node degree and are dominated bya small number of highly connected hubs (Barabasi andAlbert 1999) in this case genes that are coexpressed withmany other genes Similar scale-free properties have beenpreviously detected for node degree distributions of severalother gene expressionndashrelated parameters For instance

gene expression levels (Hoyle et al 2002 KuznetsovKnott and Bonner 2002 Luscombe et al 2002) as well aschanges in gene expression level (Ueda et al 2004) havebeen shown to follow power law distributions Coexpres-sion networks derived with different techniques for severalcancers (Agrawal 2002) and for yeast plants and animals(Bhan Galas and Dewey 2002 Bergmann Ihmels andBarkai 2004) also show scale-free properties Finallya number of other biological and other evolving networksincluding protein-protein interactions metabolic networkssocial contacts networks and the Internet (Barabasi andAlbert 1999 Jeong et al 2000 2001 Ravasz et al 2002Barabasi and Oltvai 2004) have node degree distributionsthat follow a power law and thus imply scale-freeproperties The principal mode of evolution of scale-freenetworks is thought to be preferential attachment of newnodes to those that are already highly connected (ie lsquolsquotherich get richerrsquorsquo or lsquolsquothe fit get fitterrsquorsquo model [Barabasi andAlbert 1999])

Obviously there is a transitive property to thecorrelation coefficients that are used to measure coex-pression and connect nodes in the coexpression network Ifthe tissue-specific expression pattern of gene A is cor-related with that of gene B and the pattern of gene B iscorrelated with that of gene C then one should expectexpression of gene A to be correlated with that of gene CHowever the level of correlation between A and C is anopen question because some groups of genes can be tightlycoregulated whereas others are only loosely coexpressed

FIG 2mdashThe human gene coexpression network Node degree distributions and network topologies are shown for networks where coexpressedgenes (nodes) are linked if (a) r 09 (b) r 08 and (c) r 07

2062 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

To further evaluate the topological properties of the humangene coexpression network the clustering of networknodes was analyzed The clustering coefficient (C) ofa given node is the ratio of the number of the actualconnections between the neighbors of the node to thenumber of possible connections between them (Barabasiand Oltvai 2004) The average C for the gene coexpressionnetwork is 0452 This indicates a high degree of clusteringthat is typical of many scale-free networks (Barabasi andOltvai 2004) however the fact that C 1 indicates onlylimited transitivity in the gene coexpression networkanalyzed here The shape of the dependence of C on thenode degree is thought to be indicative of the mode ofnetwork growth (Barabasi and Oltvai 2004) Furthermorea plot of the clustering coefficient C(n) versus nodedegree (n) shows the absence of any clear trend betweenthe two (fig 3b) C(n) rsquo constant as seen here isconsistent with network growth by simple preferentialattachment of nodes rather than growth by the hierarchicaladdition of modules that is thought to be characterized byC(n) rsquo n21 (Barabasi and Oltvai 2004)

The human gene coexpression network was examinedto assess the functional relationships between coexpressedgenes Pairs of coexpressed genes were functionallyclassified using the eukaryotic KOG database (Tatusov etal 2003) and the Gene Ontology database (Ashburner etal 2000) Using both of these approaches approximately175 of coexpressed gene pairs were found to encode

proteins with the same functional classification a signifi-cant (P 00001) but relatively small excess over therandom expectation (13) However far more strikingnonrandomness was observed when the coexpressed geneswere examined for co-occurrence of different functionscertain combinations of seemingly related functions werestrongly preferred (ftpftpncbinihgovpubkooninJordanMBE-04ndash0138R1SupplementaryTable2tab) Forexample signal transduction genes are coexpressed withthose involved in secretion transcription and posttransla-tional protein modification much more often than expectedby chance This is likely to reflect coexpression of genescoding for proteins that function together in biochemicaland signaling pathways

A number of highly connected individual clustersfrom the human gene coexpression network were furtherexamined with respect to the expression patterns andfunctions of their member genes The majority of theseclusters are made up of genes with narrow if not exclusivetissue-specific expression patterns (see supplementaryfigure 4) Examples of the topologies of two of theseclusters along with their tissue-specific expression patternsare shown in figure 4 The experimentally determined andpredicted functions for the member genes of these twoclusters are generally consistent with their expressionpatterns and indicate their involvement in pancreatic andtestis-related physiological functions respectively (table 2)

Conservation and Coevolution of CoexpressedHuman Genes

For each human gene with an identifiable mouseortholog the number of other human genes that it iscoexpressed with at r 07 (ftpftpncbinihgovpubkooninJordanMBE-04-0138R1SupplementaryTable3tab)was compared with its human-mouse substitution rate(fig 5 and table 1) Genes that have a higher number ofcoexpressed genes (the node degree in the transcriptionexpression network) are substantially more conservedthan genes with fewer coexpressed partners Qualitativelyidentical results are seen when the mouse expression data isused to compare the number of coexpressed mouse geneswith human-mouse evolutionary rates (see supplementaryfigure 5a) As with the relationship between expressionlevel and evolutionary rate the negative correlationbetween the number of coexpressed genes and evolutionaryrates was strongest for dN (fig 5) Thus the proteinproducts of those genes that are hubs of the transcriptioncoexpression network are subject to comparatively highlevels of functional constraint which could reflect theiressential roles in cellular processes (Hirsh and Fraser 2001Jordan et al 2002 Krylov et al 2003) multifunctionalityandor involvement in a large number of protein-proteininteractions (Fraser et al 2002 Jordan Wolf and Koonin2003) The negative correlation between dS and d(39 UTR)and the number of coexpressed genes (fig 5) is likelyto result from purifying selection maintaining tight trans-lational regulation of genes that are highly connected in thecoexpression network In contrast the correlation betweend(59 UTR) and the number of coexpressed genes was notsignificant (fig 5 and table 1)

1

10

100

1000a

b

1 10 100 1000node degree

fre

qu

en

cy

f(n) = 622 (n + 134)-117

P(χ2) ~ 021

00102030405060708

1 10 100 1000node degree

clu

ste

rin

g c

oe

ffic

ie

nt

FIG 3mdashStatistical properties of the human gene coexpressionnetwork (a) Node degree distribution Number of genes f(n) that haveexactly n coexpressed genes Equation and line for the best fittingdistribution are shown Probability associated with the v2 test for thegoodness of fit of the observed data to the theoretical generalized Paretodistribution is shown (b) Clustering coefficient (y-axis) plotted againstthe node degree (x-axis) Average clustering coefficients and standarddeviations of the averages for node degree bins are shown

Evolution and the Human Gene Coexpression Network 2063

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Although the correlation between dN and the numberof coexpressed genes was nearly as strong as that betweendN and expression breadth and level for dS the lattercorrelation was substantially stronger (table 1) This is notsurprising because codon usage adaptation is particularlyimportant for highly expressed genes (Carbone Zinovyevand Kepes 2003) Partial correlation rXYZ where X frac14number of coexpressed genes Y frac14 substitution rate andZfrac14 expression level was used to control for the effects ofexpression level on the relationship between the number ofcoexpressed genes and substitution rates In all caseswhere the substitution rate was found to be significantlycorrelated with the number of coexpressed genesmdashdNdS and d(39 UTR) (table 1)mdashthe application of partialcorrelation did not result in any significant decrease of thecorrelation coefficient that is rXYZ is not significantlysmaller than rXY (038 P 056) Thus the effect ofthe number of coexpressed genes on substitution rates isnot based on any correlation between the number ofcoexpressed genes and the expression level

The relationship between gene coexpression andevolutionary rate was further examined by comparing ther-values between pairs of human gene expression patternswith their pairwise gene substitution rate differences (fig 6and table 1) The results show that bins of human geneswith similar values of dN between human and mouse (iethose genes that evolve at similar rates) have a greaterfraction of pairwise r-values 07 This negative

correlation was substantial and statistically significant(fig 6 and table 1) In contrast the pairwise dS differencebetween human genes did not correlate with the differencein expression patterns (fig 6) Thus genes with similarpatterns of expression tend to evolve at similar rates andthe selection that leads to such coevolution appears tooperate primarily at the protein sequence level The dif-ference in evolution rates of 59 and 39 UTRs also neg-atively correlated with pairwise r-values between humangene expression patterns although the effect statisticallysignificant because of the large number of analyzed genepairs (table 1) was not nearly as pronounced as seen fordN (fig 6) As with other comparisons between evolu-tionary rate and expression patterns the magnitude of thisrelationship was greater for 39 UTRs suggesting that thisregion is more functionally relevant than the 59 UTR withrespect to the pattern of gene expression Qualitativelyidentical results are seen when the mouse expression datawas used to compare r-values between pairs of mousegenes with pairwise human-mouse gene substitution ratedifferences (see supplementary figure 5b)

The connection between gene coexpression andevolutionary conservation on a longer timescale wasinvestigated by considering the expression data alongwith the patterns of phyletic distribution inferred fromthe recently developed collection of eukaryotic KOGs(Tatusov et al 2003 Koonin et al 2004) Each humangene was mapped to a specific KOG and assigned

FIG 4mdashTwo highly connected gene expression clusters form the human gene coexpression network (a) Nodes correspond to genes and arelabeled with LocusLink ID numbers All links between genes correspond to coexpression at r 09 (b) Tissue-specific gene expression patterns of thecluster members Genes (rows) are labeled with LocusLink ID numbers Relative levels (log2 ratios) of expression are shown for each gene in eachtissue green cells show underexpression relative to the median black cells show median expression levels and red cells show overexpression

2064 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

node

deg

ree

0

10

20

30

40

50

60

001 002 003 004 005 007 009 012 018 1190

10

20

30

40

50

60

039 045 049 053 056 060 064 070 079 5980

10

20

30

40

50

60

015 021 026 030 034 038 042 048 053 0690

10

20

30

40

50

60

015 020 024 027 031 034 037 041 048 062

dSdN 3rsquo UTR d 5rsquo UTR d

FIG 5mdashThe dependence between the node degree in the human coexpression network and substitution rate Average (6 standard error) numbers of coexpressed human genes (r 07) per gene for 10ascending bins of human-mouse orthologous gene substitution rates are shown

fract

ion

r gt=

07

0008

0012

0016

002

001 002 003 004 005 007 009 013 019 1190008

0012

0016

002

003 006 008 012 015 019 024 030 041 5880008

0012

0016

002

005 010 015 021 027 034 042 053 069 1950008

0012

0016

002

004 008 012 016 021 026 033 042 056 133

FIG 6mdashCoexpressed human genes have similar substitution rates Fractions of pairwise correlation coefficient values where r 07 between human genes are shown for 10 ascending bins of pairwisehuman-mouse orthologous gene substitution rate differences

Evolutio

nandtheHuman

GeneCoexpressio

nNetw

ork

2065

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

a phyletic pattern that is the pattern of presenceabsenceof orthologous genes from the given KOG among seveneukaryotic species The phyletic patterns of KOGs werecompared to determine their similarity in terms ofpropensities for gene gain and loss over the approximately18 billion years of eukaryotic crown group evolution(Krylov et al 2003) A statistically significant positivecorrelation was detected between the similarity of evolu-tionary patterns of gain-loss and the level of coexpressionamong pairs of human genes (fig 7 and table 1) Thuscoexpressed genes tend to evolve under similar long-termevolutionary constraints

Human-Mouse Sequence Divergence Versus ExpressionProfile Divergence

Expression data collected for human and mouse werefurther compared to assess levels of regulatory divergence

(divergence of expression profiles) between orthologousgenes in the two species From the original series ofmicroarray experiments (Su et al 2002) 19 tissues thatwere studied in both human and mouse were identifiedRelative levels of gene expression across these tissues inboth species were taken as vectors and pairs of vectorsfor human-mouse orthologous genes were compared tomeasure the extent of regulatory divergence betweenspecies In a general agreement with the original report (Suet al 2002) the r-values for orthologs displayed a narrowdistribution with the median at approximately 04 Thiswas in sharp contrast to the distribution of r-values fornonorthologous human and mouse genes which hada median of approximately 0 (fig 8) The differencebetween the two distributions is highly statisticallysignificant (P 10210) Surprisingly however whenthe resulting r-values were compared with the levels ofsequence divergence to determine whether sequence andregulatory divergence were correlated no significantcorrelation between the human-mouse expression patterndivergence and the four different measures dN dSd(59 UTR) and d(39 UTR) of human-mouse sequencedivergence was detected (fig 9 and table 1) This findingstands in contrast with the results of the recent analysis ofthe connection between the sequence and expressiondivergence between paralogous human genes where asignificant negative correlation was observed betweenthe correlation coefficients of the expression profiles andsequence divergence (Makova and Li 2003) However anearlier study of yeast duplicate genes found no correlation

phyletic similarity

0008

001

0012

0014

0016

0018

002

-079 -066-027-022 -010 018 030 062 099 100

fract

ion

r gt=

07

FIG 7mdashCoexpressed human genes have similar phyletic patternsThe fractions of coexpressed gene pairs (r 07) are shown for 10ascending bins of phyletic similarity values (see Materials and Methods)

Table 1Evolutionary Rate Versus Gene Expression Comparisons

Comparison Ra Pb nc

dN vs expression breadth 21000 55endash7 3906dS vs expression breadth 20976 21endash5 390659 UTR d vs expression breadth 20539 ns 112139 UTR d vs expression breadth 20964 49endash5 1438

dN vs expression level 21000 55endash7 3906dS vs expression level 20952 11endash4 390659 UTR d vs expression level 20624 ns 112139 UTR d vs expression level 20964 49endash5 1438

dN vs number coexpressed genes 20964 49endash5 2307dS vs number coexpressed genes 20746 17endash2 230759 UTR d vs number coexpressedgenes 20588 ns 1119

39 UTR d vs number coexpressedgenes 20818 58endash3 1436

dN difference vs fraction r 07 20976 21endash5 2659971dS difference vs fraction r 07 20249 ns 265997159 UTR d difference vs fractionr 07 20812 58endash3 625521

39 UTR d difference vs fractionr 07 20867 22endash3 1030330

dN vs expression divergence 0139 ns 551dS vs expression divergence 20079 ns 55159 UTR d vs expression divergence 0164 ns 27339 UTR d vs expression divergence 20552 ns 382

a Spearman rank correlation coefficient for the average expression parameter

value against the 10 bins of substitution rate values (see figures 1 and 3ndash5)b Level of significance (n frac14 10) associated with R values ns frac14 nonsignifi-

cant (ie P 005)c The total number of comparisons that were binned

0

01

02

03

04

05

06

07

08

09

1

-1 -05 0 05 1r

cum

ulat

ive

frequ

ency

FIG 8mdashCumulative distributions of pairwise correlation coefficientvalues (r) The black line is the distribution for the comparisons ofnonorthologous genes within the human and mouse gene sets The grayline is the distribution for orthologous human-mouse gene comparisons

2066 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

between gene sequence and gene expression patterndivergence (Wagner 2000) The apparent disparity be-tween orthologs and paralogs in terms of expressionevolution in mammals seems to suggest unexpecteddifferences in the evolutionary modes and deserves furtherinvestigation

Conclusion

The results described here indicate that mammaliancoexpressed genes form a scale-free network whichprobably evolved by self-organization based on thepreferential attachment principle One possible mechanism

of this evolution is the duplication of genes together withtheir transcriptional control regions as suggested for theyeast coexpression network (Bhan Galas and Dewey2002) An examination of the contribution of paralogousgene pairs to the gene coexpression network analyzed hereindicates that such duplicated pairs (see Materials andMethods) are found approximately two times morefrequently than expected by chance (P 10210)However the fraction of coexpressed gene pairs that arerelated by duplication is small (028) suggesting thatduplication may not be a major force affecting thestructure of the gene coexpression network This findingseems to be consistent with an analysis of yeast protein

Table 2Examples of Two Human Gene Coexpression Network Clusters

Locus Link IDGenBank accessiona Gene Symbolb Annotationc Physiological Function

A Pancreas-Specific Gene Cluster

1208P16233 CLPS colipase preproprotein Pancreatic digestive enzyme1357NP_001859 CPA1 pancreatic carboxypeptidase A1 precursor Pancreatic digestive enzyme1358NP_001860 CPA2 pancreatic carboxypeptidase A2 precursor Pancreatic digestive enzyme2641P01275 GCG glucagon preproprotein Pancreatic islet hormone2813AAB19240 GP2 glycoprotein 2 (zymogen granule

membrane)The major membrane protein in thesecretory granule of theexocrine pancreas

2906AAB49992 GRIN2D N-methyl-D-aspartate receptor subunit2D precursor

Implicated in pancreatic hormonesecretionPMID 7768362

148223 NP_689695 hypothetical protein FLJ36666 Uncharacterized protein25803NP_036523 SPDEF SAM pointed domainndashcontaining ets

transcription factorProstate epithelium-specific transcriptionfactor

84444NP_115871 histone methyltransferase DOT1L Regulator of gene dynamics andchromatin activity

9436NP_004819 NCR2 natural cytotoxicity triggering receptorndash2 Mediator of killer cellsrsquocytotoxicity9610AAB67270 RIN1 ras inhibitor RIN1 A dominant negative inhibitor of the

RAS signaling pathway9753AAH41661 ZNF305 zinc finger proteinndash305 Predicted transcription regulator

A Testis-Specific Gene Cluster

6847Q15431 SYCP1 synaptonemal complex proteinndash1 Major component of the transversefilaments of synaptonemalcomplexes during meiotic prophase

10388 Q9BX26 SYCP2 synaptonemal complex proteinndash2 Major component of the axiallateralelements of synaptonemal complexesduring meiotic prophase

27285NP_055281 TEKT2 tektin 2 Major sperm microtubule protein8852NP_647450 AKAP4 A-kinase anchor proteinndash4 isoform 2 Major sperm sheath protein676 AAB87862 BRDT testis-specific bromodomain protein Testis-specific chromatin-associated protein

involved in chromatin remodeling11077 O75031 HSF2BP heat shock transcription factorndash2 binding

proteinImplicated in modulating HSF2activation in testis

7180P16562 CRISP2 testis-specific proteinndash1 Probable sperm-coating glycoprotein11116NP_919410 FGFR1OP FGFR1 oncogene partner isoform b Implicated in the proliferation and

differentiation of the erythroid lineage4291P58340 MLF1 myeloid leukemia factorndash1 Uncharacterized protein implicated in

cell proliferation5261AAH02541 PHKG2 phosphorylase kinase gamma 2 (testis) Testisliver-specific regulator of

glycogen metabolism54344 O94777 DPM3 dolichyl-phosphate mannosyltransferase

polypeptidendash3 isoform 2Regulator of dolichyl-phosphate mannosebiosynthesis Might be important forsperm maturation PMID 6231179

7288O00295 TULP2 tubby-like proteinndash2 Retina-specific and testis-specificheterotrimeric G-proteinndashresponsiveintracellular signaling factor

51460NP_057413 SFMBT1 Scm-like with four mbt domainsndash1 Polycomb group transcription regulator114049 NP_684281 WBSCR22 Williams Beuren syndrome chromosome

region 22 proteintRNA 5-cytosine methylase

a Locus Link identification numbers and GenBank accessions from NCBIrsquos LocusLink database httpwwwncbinlmnihgovLocusLinkindexhtmlb Official gene symbol from the Human Genome Organization (HUGO) httpwwwgeneuclacuknomenclaturec Official HUGO gene name

Evolution and the Human Gene Coexpression Network 2067

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

interaction networks which showed that the networkstructure is not predominantly shaped by duplication(Wagner 2003)

Our results show that overall levels of expressionand more specific topological properties of the gene co-expression network are clearly related to the rates ofsequence evolution highly connected network hubs tendto evolve slowly and genes that are coexpressed showa strong tendency to evolve at similar rates Theserelationships most likely reflect purifying selection whichappears to act primarily at the level of the proteinsequence Generally these results are compatible withthe notion of greater biological importance of highlyconnected nodes of biological networks (Jeong et al2001) However two of the findings reported here appearto be distinctly surprising Firstly we found that theconnection between gene coexpression network topologyand sequence evolution held for both nonsynonymous andsynonymous sites in the coding sequence and 39 UTR butnot for the 59 UTR Thus the constraints on 59 UTRevolution appear to be unrelated to expression regulationas analyzed here Secondly we found that although theexpression profiles of orthologs tend to be stronglycorrelated they diverged at roughly the same rate acrossa wide range of sequence evolution rates Thus unlike theproperties of the gene coexpression network evolution ofan individual genersquos expression regulation after speciationseems to be uncoupled from the functional constraints onthe genersquos sequence This is generally compatible withthe pregenomic notion that regulatory evolution could bethe decisive factor of biological diversification betweenspecies (Britten and Davidson 1969 King and Wilson1975) and can be taken to suggest that sequence diver-gence and expression pattern divergence are governed bydistinct forces

Recent studies have reported the rapid diversificationof gene expression patterns and suggested that this mightreflect neutral evolution of transcriptional regulation(Khaitovich et al 2004 Yanai Graur and Ophir 2004)The lack of correlation between sequence divergence andexpression profile divergence demonstrated here could bedeemed as compatible with the neutral model wherebytranscription profiles of orthologs diverge rapidly uponspeciation until they reach a basal level of similarity that isthen maintained by purifying selection However it istempting to speculate as to a distinct role for naturalselection in driving expression pattern divergence Perhapswhereas sequence divergence levels are determined largelyby purifying selection the effects of adaptive (diversifying)selection are more prevalent at the level of gene expressionUnder this hypothetical scenario although the basal level ofcorrelation between orthologs tends to persist probablyreflecting the conservation of the general function theaspects of the gene expression patterns that reflect species-specific changes governed in part by short degeneratetranscription factorndashbinding sites would be expected todiverge rapidly Indeed our analysis of the relationshipbetween gene coexpression in this work and geneduplication as well as previous results (Gu et al 2002)suggest rapid divergence of expression patterns afterduplication Once the expression patterns of homologous

correlation 01

02

03

04

05

06

001

002

003

004

005

007

009

012

019

052

01

02

03

04

05

06

040

046

050

055

058

062

065

071

079

127

01

02

03

04

05

06

017

024

032

037

045

053

063

072

088

137

01

02

03

04

05

06

017

023

029

034

040

044

050

059

073

120

3rsquo U

TR d

5rsquo

UTR

d

dNdS

FIG9mdash

Thelack

ofdependence

betweenhuman-m

ouse

generegulatory

divergence

andsequence

divergenceAveragecorrelationcoefficientvalues

(r)betweenexpressionprofilesofhuman-m

ouse

orthologousgeneareshownfor10ascendingbinsofhuman-m

ouse

orthologousgenesubstitutionrates

2068 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

genes (paralogs or orthologs) diverge convergent evolu-tion which is a hallmark of adaptation and is widely evidentfor phenotypic characters could cause unrelated genes toachieve similar expression patterns This convergence couldbe related to the rapid de novo generation of transcriptionfactorndashbinding sites in promoters that lead to slight changesin expression patterns If the functionally relevant aspectsof the ancestral expression pattern are maintained duringthis process subtle expression pattern changes would besimultaneously invisible to purifying selection and serve asthe raw material for repeated trials of adaptive selectionThis hypothesis yields a specific prediction with regard tothe evolution of promoter regions that is currently beinginvestigated Expression profiles are predicted to be morecorrelated with the distributions of specific transcriptionfactorndashbinding sites along promoters than with the overallsequence divergence (ie relatedness) between promoters

Acknowledgments

We thank Igor Rogozin David Landsman FyodorKondrashov Itai Yanai and Katerina Makova for helpfuldiscussions

Supplementary Material

Supplementary Figure 1mdashThe dependence betweenexpression breadth level and substitution rates of mousegenes (andashd) Average (6 standard error) expressionbreadth values for 10 ascending bins of human-mouseorthologous gene substitution rates (endashh) Average (6standard error) expression level values for 10 ascendingbins of human-mouse orthologous gene substitution rates

Supplementary Figure 2mdashFrequency distributions ofhuman-mouse substitution rate ratios Ratios shown inlog scale (a) d(59 UTR)d(39 UTR) d(59 UTR)d(39 UTR)is shown in log scale In 43 of genes d(59 UTR) d(39UTR) whereas in 57 d(39 UTR) d(59 UTR) (b)d(59 UTR)dS (c) d(39 UTRdS)

Supplementary Figure 3mdashFrequency distributions ofthe number of links per node in human gene coexpressionnetworks For each network the correlation coefficientvalue (r) threshold used to determine whether any twogenes (nodes) are considered to be coexpressed (linked)are shown above the plot The slope of the linear trend linethat fits the data and the r2 value indicating the goodness offit to the power law are shown for each network plot

Supplementary Figure 4mdashTissue-specific expressionpatterns for human gene coexpression network (r 09)clusters Cluster expression patterns 1 adult and fetal liver2 fetal liver 3 testis 4 pancreas 5 testis and umbilicalvein 6 various brain tissues 7 salivary gland 8 adultliver 9 thymus 10 spleen 11 lung 12 cerebellum

Supplementary Figure 5mdashRelationship between themouse coexpression network topology and substitutionrates (a) The dependence between the node degree andsubstitution rate Average (6 standard error) numbers ofcoexpressed genes (node degree for r 07) per gene for10 ascending bins of human-mouse orthologous gene sub-stitution rates are shown (b) Coexpressedmouse genes havesimilar substitution rates Fractions of pairwise correlation

coefficient values where r 07 between mouse genes areshown for 10 ascending bins of pairwise human-mouseorthologous gene substitution rate differences

Literature Cited

Agrawal H 2002 Extreme self-organization in networksconstructed from gene expression data Phys Rev Lett89268702

Ashburner M C A Ball J A Blake et al 2000 Geneontology tool for the unification of biology The GeneOntology Consortium Nat Genet 2525ndash29

Barabasi A L and R Albert 1999 Emergence of scaling inrandom networks Science 286509ndash512

Barabasi A L and Z N Oltvai 2004 Network biologyunderstanding the cellrsquos functional organization Nat RevGenet 5101ndash113

Bergmann S J Ihmels and N Barkai 2004 Similarities anddifferences in genome-wide expression data of six organismsPLoS Biol 20085ndash0093

Bhan A D J Galas and T G Dewey 2002 A duplicationgrowth model of gene expression networks Bioinformatics181486ndash1493

Bloom J D and C Adami 2003 Apparent dependence ofprotein evolutionary rate on number of interactions is linkedto biases in protein-protein interactions data sets BMC EvolBiol 321

Britten R J and E H Davidson 1969 Gene regulation forhigher cells a theory Science 165349ndash357

Carbone A A Zinovyev and F Kepes 2003 Codon adaptationindex as a measure of dominating codon bias Bioinformatics192005ndash2015

Darwin C 1859 On the origin of species John Murray LondonDuret L and D Mouchiroud 2000 Determinants of sub-

stitution rates in mammalian genes expression pattern affectsselection intensity but not mutation rate Mol Biol Evol1768ndash74

Eisen M B P T Spellman P O Brown and D Botstein1998 Cluster analysis and display of genome-wide expressionpatterns Proc Natl Acad Sci USA 9514863ndash14868

Farris J S 1977 Phylogenetic analysis under Dollorsquos law SystZool 2677ndash88

Fraser H B A E Hirsh L M Steinmetz C Scharfe andM W Feldman 2002 Evolutionary rate in the protein inter-action network Science 296750ndash752

Fraser H B D P Wall and A E Hirsh 2003 A simpledependence between protein evolution rate and the number ofprotein-protein interactions BMC Evol Biol 311

Giot L J S Bader C Brouwer et al 2003 A protein inter-action map of Drosophila melanogaster Science 30217271736

Gu Z D Nicolae H H Lu and W H Li 2002 Rapiddivergence in expression between duplicate genes inferredfrom microarray data Trends Genet 18609ndash613

Hedges S B H Chen S Kumar D Y Wang A S Thompsonand H Watanabe 2001 A genomic timescale for the origin ofeukaryotes BMC Evol Biol 14

Higgins D G J D Thompson and T J Gibson 1996 UsingCLUSTAL for multiple sequence alignments MethodsEnzymol 266383ndash402

Hirsh A E and H B Fraser 2001 Protein dispensability andrate of evolution Nature 4111046ndash1049

Ho Y A Gruhler A Heilbut et al 2002 Systematic iden-tification of protein complexes in Saccharomyces cerevisiaeby mass spectrometry Nature 415180ndash183

Evolution and the Human Gene Coexpression Network 2069

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Hoyle DCMRattray R Jupp andABrass 2002Making senseof microarray data distributions Bioinformatics 18576ndash584

Jeong H S P Mason A L Barabasi and Z N Oltvai 2001Lethality and centrality in protein networks Nature 41141ndash42

Jeong H B Tombor R Albert Z N Oltvai and A LBarabasi 2000 The large-scale organization of metabolicnetworks Nature 407651ndash654

Jordan I K I B Rogozin Y I Wolf and E V Koonin 2002Essential genes are more evolutionarily conserved than arenonessential genes in bacteria Genome Res 12962ndash968

Jordan I K Y I Wolf and E V Koonin 2003 No simpledependence between protein evolution rate and the number ofprotein-protein interactions only the most prolific interactorstend to evolve slowly BMC Evol Biol 31

Jukes T H and C R Cantor 1969 Evolution of proteinmolecules Pp 21ndash132 in H N Munro ed Mammalianprotein metabolism Academic New York

Kamath R S A G Fraser Y Dong et al 2003 Systematicfunctional analysis of the Caenorhabditis elegans genomeusing RNAi Nature 421231ndash237

Karev G P Y I Wolf A Y Rzhetsky F S Berezovskayaand E V Koonin 2002 Birth and death of protein domainsa simple model of evolution explains power law behaviorBMC Evol Biol 218

Khaitovich P G Weiss M Lachmann I Hellman W EnardB Muetzel U Wiekner W Ansorge and S Paabo 2004 Aneutral model of transcriptome evolution PLOS Biol20682ndash0689

King M C and A C Wilson 1975 Evolution at two levels inhumans and chimpanzees Science 188107ndash116

Koonin E V N D Fedorova J D Jackson et al 2004 Acomprehensive evolutionary classification of proteins encodedin complete eukaryotic genomes Genome Biol 5R7

Koonin E V Y I Wolf and G P Karev 2002 The structureof the protein universe and genome evolution Nature420218ndash223

Krylov D M Y I Wolf I B Rogozin and E V Koonin2003 Gene loss protein sequence divergence gene dispens-ability expression level and interactivity are correlated ineukaryotic evolution Genome Res 132229ndash2235

Kuznetsov V A G D Knott and R F Bonner 2002 Generalstatistics of stochastic process of gene expression ineukaryotic cells Genetics 1611321ndash1332

Lander E S L M Linton B Birren et al 2001 Initialsequencing and analysis of the human genome Nature 409860ndash921

Li S C M Armstrong N Bertin et al 2004 A map of theinteractome network of the metazoan C elegans Science 303540ndash543

Li W H 1997 Molecular evolution Sinauer AssociatesSunderland Mass

Luscombe N M J Qian Z Zhang T Johnson and MGerstein 2002 The dominance of the population by a selected

few power-law behaviour applies to a wide variety ofgenomic properties Genome Biol3RESEARCH0040

Makova K D and W H Li 2003 Divergence in the spatialpattern of gene expression between human duplicate genesGenome Res 131638ndash1645

Nei M and T Gojobori 1986 Simple methods for estimatingthe numbers of synonymous and nonsynonymous nucleotidesubstitutions Mol Biol Evol 3418ndash426

Pal C B Papp and L D Hurst 2001 Highly expressed genesin yeast evolve slowly Genetics 158927ndash931

mdashmdashmdash 2003 Genomic function rate of evolution and genedispensability Nature 421496ndash497

Pennisi E 2003 Systems biology tracing lifersquos circuitryScience 3021646ndash1649

Ravasz E A L Somera D A Mongru Z N Oltvai andA L Barabasi 2002 Hierarchical organization of modularityin metabolic networks Science 2971551ndash1555

Rzhetsky A and S M Gomez 2001 Birth of scale-freemolecular networks and the number of distinct DNA andprotein domains per genome Bioinformatics 17988ndash996

Su A I M P Cooke K A Ching et al 2002 Large-scaleanalysis of the human and mouse transcriptomes Proc NatlAcad Sci USA 994465ndash4470

Tatusov R L N D Fedorova J D Jackson et al 2003 TheCOG database an updated version includes eukaryotes BMCBioinformatics 441

Ueda H R S Hayshi S Matsuyama T Yomo S HashimotoS A Kay J B Hogenesch and M Iino 2004 Universalityand flexibility in gene expression from bacteria to humanProc Natl Acad Sci USA 1013765ndash3769

Wagner A 2000 Decoupled evolution of coding region andmRNA expression patterns after gene duplication implica-tions for the neutralist-selectionist debate Proc Natl AcadSci USA 976579ndash6584

mdashmdashmdash 2003 How the global structure of protein interaction net-works evolves Proc R Soc Lond B Biol Sci 270457ndash466

Waterston R H K Lindblad-Toh E Birney et al 2002 Initialsequencing and comparative analysis of the mouse genomeNature 420520ndash562

Yanai I D Graur and R Ophir 2004 Incongruent expressionprofiles between human and mouse orthologous genes suggestwidespread neutral evolution of transcription control OMICS815ndash24

Yang J ZGu andWHLi 2003Rate of protein evolution versusfitness effect of gene deletion Mol Biol Evol 20772ndash774

Yang Z 1997 PAML a programpackage for phylogenetic analysisby maximum likelihood Comput Appl Biosci 13555ndash556

Zhang L and W H Li 2004 Mammalian housekeeping genesevolve more slowly than tissue-specific genes Mol BiolEvol 21236ndash239

Peer Bork Associate Editor

Accepted July 14 2004

2070 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Page 6: Conservation and Coevolution in the Scale-Free Human Gene

To further evaluate the topological properties of the humangene coexpression network the clustering of networknodes was analyzed The clustering coefficient (C) ofa given node is the ratio of the number of the actualconnections between the neighbors of the node to thenumber of possible connections between them (Barabasiand Oltvai 2004) The average C for the gene coexpressionnetwork is 0452 This indicates a high degree of clusteringthat is typical of many scale-free networks (Barabasi andOltvai 2004) however the fact that C 1 indicates onlylimited transitivity in the gene coexpression networkanalyzed here The shape of the dependence of C on thenode degree is thought to be indicative of the mode ofnetwork growth (Barabasi and Oltvai 2004) Furthermorea plot of the clustering coefficient C(n) versus nodedegree (n) shows the absence of any clear trend betweenthe two (fig 3b) C(n) rsquo constant as seen here isconsistent with network growth by simple preferentialattachment of nodes rather than growth by the hierarchicaladdition of modules that is thought to be characterized byC(n) rsquo n21 (Barabasi and Oltvai 2004)

The human gene coexpression network was examinedto assess the functional relationships between coexpressedgenes Pairs of coexpressed genes were functionallyclassified using the eukaryotic KOG database (Tatusov etal 2003) and the Gene Ontology database (Ashburner etal 2000) Using both of these approaches approximately175 of coexpressed gene pairs were found to encode

proteins with the same functional classification a signifi-cant (P 00001) but relatively small excess over therandom expectation (13) However far more strikingnonrandomness was observed when the coexpressed geneswere examined for co-occurrence of different functionscertain combinations of seemingly related functions werestrongly preferred (ftpftpncbinihgovpubkooninJordanMBE-04ndash0138R1SupplementaryTable2tab) Forexample signal transduction genes are coexpressed withthose involved in secretion transcription and posttransla-tional protein modification much more often than expectedby chance This is likely to reflect coexpression of genescoding for proteins that function together in biochemicaland signaling pathways

A number of highly connected individual clustersfrom the human gene coexpression network were furtherexamined with respect to the expression patterns andfunctions of their member genes The majority of theseclusters are made up of genes with narrow if not exclusivetissue-specific expression patterns (see supplementaryfigure 4) Examples of the topologies of two of theseclusters along with their tissue-specific expression patternsare shown in figure 4 The experimentally determined andpredicted functions for the member genes of these twoclusters are generally consistent with their expressionpatterns and indicate their involvement in pancreatic andtestis-related physiological functions respectively (table 2)

Conservation and Coevolution of CoexpressedHuman Genes

For each human gene with an identifiable mouseortholog the number of other human genes that it iscoexpressed with at r 07 (ftpftpncbinihgovpubkooninJordanMBE-04-0138R1SupplementaryTable3tab)was compared with its human-mouse substitution rate(fig 5 and table 1) Genes that have a higher number ofcoexpressed genes (the node degree in the transcriptionexpression network) are substantially more conservedthan genes with fewer coexpressed partners Qualitativelyidentical results are seen when the mouse expression data isused to compare the number of coexpressed mouse geneswith human-mouse evolutionary rates (see supplementaryfigure 5a) As with the relationship between expressionlevel and evolutionary rate the negative correlationbetween the number of coexpressed genes and evolutionaryrates was strongest for dN (fig 5) Thus the proteinproducts of those genes that are hubs of the transcriptioncoexpression network are subject to comparatively highlevels of functional constraint which could reflect theiressential roles in cellular processes (Hirsh and Fraser 2001Jordan et al 2002 Krylov et al 2003) multifunctionalityandor involvement in a large number of protein-proteininteractions (Fraser et al 2002 Jordan Wolf and Koonin2003) The negative correlation between dS and d(39 UTR)and the number of coexpressed genes (fig 5) is likelyto result from purifying selection maintaining tight trans-lational regulation of genes that are highly connected in thecoexpression network In contrast the correlation betweend(59 UTR) and the number of coexpressed genes was notsignificant (fig 5 and table 1)

1

10

100

1000a

b

1 10 100 1000node degree

fre

qu

en

cy

f(n) = 622 (n + 134)-117

P(χ2) ~ 021

00102030405060708

1 10 100 1000node degree

clu

ste

rin

g c

oe

ffic

ie

nt

FIG 3mdashStatistical properties of the human gene coexpressionnetwork (a) Node degree distribution Number of genes f(n) that haveexactly n coexpressed genes Equation and line for the best fittingdistribution are shown Probability associated with the v2 test for thegoodness of fit of the observed data to the theoretical generalized Paretodistribution is shown (b) Clustering coefficient (y-axis) plotted againstthe node degree (x-axis) Average clustering coefficients and standarddeviations of the averages for node degree bins are shown

Evolution and the Human Gene Coexpression Network 2063

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Although the correlation between dN and the numberof coexpressed genes was nearly as strong as that betweendN and expression breadth and level for dS the lattercorrelation was substantially stronger (table 1) This is notsurprising because codon usage adaptation is particularlyimportant for highly expressed genes (Carbone Zinovyevand Kepes 2003) Partial correlation rXYZ where X frac14number of coexpressed genes Y frac14 substitution rate andZfrac14 expression level was used to control for the effects ofexpression level on the relationship between the number ofcoexpressed genes and substitution rates In all caseswhere the substitution rate was found to be significantlycorrelated with the number of coexpressed genesmdashdNdS and d(39 UTR) (table 1)mdashthe application of partialcorrelation did not result in any significant decrease of thecorrelation coefficient that is rXYZ is not significantlysmaller than rXY (038 P 056) Thus the effect ofthe number of coexpressed genes on substitution rates isnot based on any correlation between the number ofcoexpressed genes and the expression level

The relationship between gene coexpression andevolutionary rate was further examined by comparing ther-values between pairs of human gene expression patternswith their pairwise gene substitution rate differences (fig 6and table 1) The results show that bins of human geneswith similar values of dN between human and mouse (iethose genes that evolve at similar rates) have a greaterfraction of pairwise r-values 07 This negative

correlation was substantial and statistically significant(fig 6 and table 1) In contrast the pairwise dS differencebetween human genes did not correlate with the differencein expression patterns (fig 6) Thus genes with similarpatterns of expression tend to evolve at similar rates andthe selection that leads to such coevolution appears tooperate primarily at the protein sequence level The dif-ference in evolution rates of 59 and 39 UTRs also neg-atively correlated with pairwise r-values between humangene expression patterns although the effect statisticallysignificant because of the large number of analyzed genepairs (table 1) was not nearly as pronounced as seen fordN (fig 6) As with other comparisons between evolu-tionary rate and expression patterns the magnitude of thisrelationship was greater for 39 UTRs suggesting that thisregion is more functionally relevant than the 59 UTR withrespect to the pattern of gene expression Qualitativelyidentical results are seen when the mouse expression datawas used to compare r-values between pairs of mousegenes with pairwise human-mouse gene substitution ratedifferences (see supplementary figure 5b)

The connection between gene coexpression andevolutionary conservation on a longer timescale wasinvestigated by considering the expression data alongwith the patterns of phyletic distribution inferred fromthe recently developed collection of eukaryotic KOGs(Tatusov et al 2003 Koonin et al 2004) Each humangene was mapped to a specific KOG and assigned

FIG 4mdashTwo highly connected gene expression clusters form the human gene coexpression network (a) Nodes correspond to genes and arelabeled with LocusLink ID numbers All links between genes correspond to coexpression at r 09 (b) Tissue-specific gene expression patterns of thecluster members Genes (rows) are labeled with LocusLink ID numbers Relative levels (log2 ratios) of expression are shown for each gene in eachtissue green cells show underexpression relative to the median black cells show median expression levels and red cells show overexpression

2064 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

node

deg

ree

0

10

20

30

40

50

60

001 002 003 004 005 007 009 012 018 1190

10

20

30

40

50

60

039 045 049 053 056 060 064 070 079 5980

10

20

30

40

50

60

015 021 026 030 034 038 042 048 053 0690

10

20

30

40

50

60

015 020 024 027 031 034 037 041 048 062

dSdN 3rsquo UTR d 5rsquo UTR d

FIG 5mdashThe dependence between the node degree in the human coexpression network and substitution rate Average (6 standard error) numbers of coexpressed human genes (r 07) per gene for 10ascending bins of human-mouse orthologous gene substitution rates are shown

fract

ion

r gt=

07

0008

0012

0016

002

001 002 003 004 005 007 009 013 019 1190008

0012

0016

002

003 006 008 012 015 019 024 030 041 5880008

0012

0016

002

005 010 015 021 027 034 042 053 069 1950008

0012

0016

002

004 008 012 016 021 026 033 042 056 133

FIG 6mdashCoexpressed human genes have similar substitution rates Fractions of pairwise correlation coefficient values where r 07 between human genes are shown for 10 ascending bins of pairwisehuman-mouse orthologous gene substitution rate differences

Evolutio

nandtheHuman

GeneCoexpressio

nNetw

ork

2065

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

a phyletic pattern that is the pattern of presenceabsenceof orthologous genes from the given KOG among seveneukaryotic species The phyletic patterns of KOGs werecompared to determine their similarity in terms ofpropensities for gene gain and loss over the approximately18 billion years of eukaryotic crown group evolution(Krylov et al 2003) A statistically significant positivecorrelation was detected between the similarity of evolu-tionary patterns of gain-loss and the level of coexpressionamong pairs of human genes (fig 7 and table 1) Thuscoexpressed genes tend to evolve under similar long-termevolutionary constraints

Human-Mouse Sequence Divergence Versus ExpressionProfile Divergence

Expression data collected for human and mouse werefurther compared to assess levels of regulatory divergence

(divergence of expression profiles) between orthologousgenes in the two species From the original series ofmicroarray experiments (Su et al 2002) 19 tissues thatwere studied in both human and mouse were identifiedRelative levels of gene expression across these tissues inboth species were taken as vectors and pairs of vectorsfor human-mouse orthologous genes were compared tomeasure the extent of regulatory divergence betweenspecies In a general agreement with the original report (Suet al 2002) the r-values for orthologs displayed a narrowdistribution with the median at approximately 04 Thiswas in sharp contrast to the distribution of r-values fornonorthologous human and mouse genes which hada median of approximately 0 (fig 8) The differencebetween the two distributions is highly statisticallysignificant (P 10210) Surprisingly however whenthe resulting r-values were compared with the levels ofsequence divergence to determine whether sequence andregulatory divergence were correlated no significantcorrelation between the human-mouse expression patterndivergence and the four different measures dN dSd(59 UTR) and d(39 UTR) of human-mouse sequencedivergence was detected (fig 9 and table 1) This findingstands in contrast with the results of the recent analysis ofthe connection between the sequence and expressiondivergence between paralogous human genes where asignificant negative correlation was observed betweenthe correlation coefficients of the expression profiles andsequence divergence (Makova and Li 2003) However anearlier study of yeast duplicate genes found no correlation

phyletic similarity

0008

001

0012

0014

0016

0018

002

-079 -066-027-022 -010 018 030 062 099 100

fract

ion

r gt=

07

FIG 7mdashCoexpressed human genes have similar phyletic patternsThe fractions of coexpressed gene pairs (r 07) are shown for 10ascending bins of phyletic similarity values (see Materials and Methods)

Table 1Evolutionary Rate Versus Gene Expression Comparisons

Comparison Ra Pb nc

dN vs expression breadth 21000 55endash7 3906dS vs expression breadth 20976 21endash5 390659 UTR d vs expression breadth 20539 ns 112139 UTR d vs expression breadth 20964 49endash5 1438

dN vs expression level 21000 55endash7 3906dS vs expression level 20952 11endash4 390659 UTR d vs expression level 20624 ns 112139 UTR d vs expression level 20964 49endash5 1438

dN vs number coexpressed genes 20964 49endash5 2307dS vs number coexpressed genes 20746 17endash2 230759 UTR d vs number coexpressedgenes 20588 ns 1119

39 UTR d vs number coexpressedgenes 20818 58endash3 1436

dN difference vs fraction r 07 20976 21endash5 2659971dS difference vs fraction r 07 20249 ns 265997159 UTR d difference vs fractionr 07 20812 58endash3 625521

39 UTR d difference vs fractionr 07 20867 22endash3 1030330

dN vs expression divergence 0139 ns 551dS vs expression divergence 20079 ns 55159 UTR d vs expression divergence 0164 ns 27339 UTR d vs expression divergence 20552 ns 382

a Spearman rank correlation coefficient for the average expression parameter

value against the 10 bins of substitution rate values (see figures 1 and 3ndash5)b Level of significance (n frac14 10) associated with R values ns frac14 nonsignifi-

cant (ie P 005)c The total number of comparisons that were binned

0

01

02

03

04

05

06

07

08

09

1

-1 -05 0 05 1r

cum

ulat

ive

frequ

ency

FIG 8mdashCumulative distributions of pairwise correlation coefficientvalues (r) The black line is the distribution for the comparisons ofnonorthologous genes within the human and mouse gene sets The grayline is the distribution for orthologous human-mouse gene comparisons

2066 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

between gene sequence and gene expression patterndivergence (Wagner 2000) The apparent disparity be-tween orthologs and paralogs in terms of expressionevolution in mammals seems to suggest unexpecteddifferences in the evolutionary modes and deserves furtherinvestigation

Conclusion

The results described here indicate that mammaliancoexpressed genes form a scale-free network whichprobably evolved by self-organization based on thepreferential attachment principle One possible mechanism

of this evolution is the duplication of genes together withtheir transcriptional control regions as suggested for theyeast coexpression network (Bhan Galas and Dewey2002) An examination of the contribution of paralogousgene pairs to the gene coexpression network analyzed hereindicates that such duplicated pairs (see Materials andMethods) are found approximately two times morefrequently than expected by chance (P 10210)However the fraction of coexpressed gene pairs that arerelated by duplication is small (028) suggesting thatduplication may not be a major force affecting thestructure of the gene coexpression network This findingseems to be consistent with an analysis of yeast protein

Table 2Examples of Two Human Gene Coexpression Network Clusters

Locus Link IDGenBank accessiona Gene Symbolb Annotationc Physiological Function

A Pancreas-Specific Gene Cluster

1208P16233 CLPS colipase preproprotein Pancreatic digestive enzyme1357NP_001859 CPA1 pancreatic carboxypeptidase A1 precursor Pancreatic digestive enzyme1358NP_001860 CPA2 pancreatic carboxypeptidase A2 precursor Pancreatic digestive enzyme2641P01275 GCG glucagon preproprotein Pancreatic islet hormone2813AAB19240 GP2 glycoprotein 2 (zymogen granule

membrane)The major membrane protein in thesecretory granule of theexocrine pancreas

2906AAB49992 GRIN2D N-methyl-D-aspartate receptor subunit2D precursor

Implicated in pancreatic hormonesecretionPMID 7768362

148223 NP_689695 hypothetical protein FLJ36666 Uncharacterized protein25803NP_036523 SPDEF SAM pointed domainndashcontaining ets

transcription factorProstate epithelium-specific transcriptionfactor

84444NP_115871 histone methyltransferase DOT1L Regulator of gene dynamics andchromatin activity

9436NP_004819 NCR2 natural cytotoxicity triggering receptorndash2 Mediator of killer cellsrsquocytotoxicity9610AAB67270 RIN1 ras inhibitor RIN1 A dominant negative inhibitor of the

RAS signaling pathway9753AAH41661 ZNF305 zinc finger proteinndash305 Predicted transcription regulator

A Testis-Specific Gene Cluster

6847Q15431 SYCP1 synaptonemal complex proteinndash1 Major component of the transversefilaments of synaptonemalcomplexes during meiotic prophase

10388 Q9BX26 SYCP2 synaptonemal complex proteinndash2 Major component of the axiallateralelements of synaptonemal complexesduring meiotic prophase

27285NP_055281 TEKT2 tektin 2 Major sperm microtubule protein8852NP_647450 AKAP4 A-kinase anchor proteinndash4 isoform 2 Major sperm sheath protein676 AAB87862 BRDT testis-specific bromodomain protein Testis-specific chromatin-associated protein

involved in chromatin remodeling11077 O75031 HSF2BP heat shock transcription factorndash2 binding

proteinImplicated in modulating HSF2activation in testis

7180P16562 CRISP2 testis-specific proteinndash1 Probable sperm-coating glycoprotein11116NP_919410 FGFR1OP FGFR1 oncogene partner isoform b Implicated in the proliferation and

differentiation of the erythroid lineage4291P58340 MLF1 myeloid leukemia factorndash1 Uncharacterized protein implicated in

cell proliferation5261AAH02541 PHKG2 phosphorylase kinase gamma 2 (testis) Testisliver-specific regulator of

glycogen metabolism54344 O94777 DPM3 dolichyl-phosphate mannosyltransferase

polypeptidendash3 isoform 2Regulator of dolichyl-phosphate mannosebiosynthesis Might be important forsperm maturation PMID 6231179

7288O00295 TULP2 tubby-like proteinndash2 Retina-specific and testis-specificheterotrimeric G-proteinndashresponsiveintracellular signaling factor

51460NP_057413 SFMBT1 Scm-like with four mbt domainsndash1 Polycomb group transcription regulator114049 NP_684281 WBSCR22 Williams Beuren syndrome chromosome

region 22 proteintRNA 5-cytosine methylase

a Locus Link identification numbers and GenBank accessions from NCBIrsquos LocusLink database httpwwwncbinlmnihgovLocusLinkindexhtmlb Official gene symbol from the Human Genome Organization (HUGO) httpwwwgeneuclacuknomenclaturec Official HUGO gene name

Evolution and the Human Gene Coexpression Network 2067

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

interaction networks which showed that the networkstructure is not predominantly shaped by duplication(Wagner 2003)

Our results show that overall levels of expressionand more specific topological properties of the gene co-expression network are clearly related to the rates ofsequence evolution highly connected network hubs tendto evolve slowly and genes that are coexpressed showa strong tendency to evolve at similar rates Theserelationships most likely reflect purifying selection whichappears to act primarily at the level of the proteinsequence Generally these results are compatible withthe notion of greater biological importance of highlyconnected nodes of biological networks (Jeong et al2001) However two of the findings reported here appearto be distinctly surprising Firstly we found that theconnection between gene coexpression network topologyand sequence evolution held for both nonsynonymous andsynonymous sites in the coding sequence and 39 UTR butnot for the 59 UTR Thus the constraints on 59 UTRevolution appear to be unrelated to expression regulationas analyzed here Secondly we found that although theexpression profiles of orthologs tend to be stronglycorrelated they diverged at roughly the same rate acrossa wide range of sequence evolution rates Thus unlike theproperties of the gene coexpression network evolution ofan individual genersquos expression regulation after speciationseems to be uncoupled from the functional constraints onthe genersquos sequence This is generally compatible withthe pregenomic notion that regulatory evolution could bethe decisive factor of biological diversification betweenspecies (Britten and Davidson 1969 King and Wilson1975) and can be taken to suggest that sequence diver-gence and expression pattern divergence are governed bydistinct forces

Recent studies have reported the rapid diversificationof gene expression patterns and suggested that this mightreflect neutral evolution of transcriptional regulation(Khaitovich et al 2004 Yanai Graur and Ophir 2004)The lack of correlation between sequence divergence andexpression profile divergence demonstrated here could bedeemed as compatible with the neutral model wherebytranscription profiles of orthologs diverge rapidly uponspeciation until they reach a basal level of similarity that isthen maintained by purifying selection However it istempting to speculate as to a distinct role for naturalselection in driving expression pattern divergence Perhapswhereas sequence divergence levels are determined largelyby purifying selection the effects of adaptive (diversifying)selection are more prevalent at the level of gene expressionUnder this hypothetical scenario although the basal level ofcorrelation between orthologs tends to persist probablyreflecting the conservation of the general function theaspects of the gene expression patterns that reflect species-specific changes governed in part by short degeneratetranscription factorndashbinding sites would be expected todiverge rapidly Indeed our analysis of the relationshipbetween gene coexpression in this work and geneduplication as well as previous results (Gu et al 2002)suggest rapid divergence of expression patterns afterduplication Once the expression patterns of homologous

correlation 01

02

03

04

05

06

001

002

003

004

005

007

009

012

019

052

01

02

03

04

05

06

040

046

050

055

058

062

065

071

079

127

01

02

03

04

05

06

017

024

032

037

045

053

063

072

088

137

01

02

03

04

05

06

017

023

029

034

040

044

050

059

073

120

3rsquo U

TR d

5rsquo

UTR

d

dNdS

FIG9mdash

Thelack

ofdependence

betweenhuman-m

ouse

generegulatory

divergence

andsequence

divergenceAveragecorrelationcoefficientvalues

(r)betweenexpressionprofilesofhuman-m

ouse

orthologousgeneareshownfor10ascendingbinsofhuman-m

ouse

orthologousgenesubstitutionrates

2068 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

genes (paralogs or orthologs) diverge convergent evolu-tion which is a hallmark of adaptation and is widely evidentfor phenotypic characters could cause unrelated genes toachieve similar expression patterns This convergence couldbe related to the rapid de novo generation of transcriptionfactorndashbinding sites in promoters that lead to slight changesin expression patterns If the functionally relevant aspectsof the ancestral expression pattern are maintained duringthis process subtle expression pattern changes would besimultaneously invisible to purifying selection and serve asthe raw material for repeated trials of adaptive selectionThis hypothesis yields a specific prediction with regard tothe evolution of promoter regions that is currently beinginvestigated Expression profiles are predicted to be morecorrelated with the distributions of specific transcriptionfactorndashbinding sites along promoters than with the overallsequence divergence (ie relatedness) between promoters

Acknowledgments

We thank Igor Rogozin David Landsman FyodorKondrashov Itai Yanai and Katerina Makova for helpfuldiscussions

Supplementary Material

Supplementary Figure 1mdashThe dependence betweenexpression breadth level and substitution rates of mousegenes (andashd) Average (6 standard error) expressionbreadth values for 10 ascending bins of human-mouseorthologous gene substitution rates (endashh) Average (6standard error) expression level values for 10 ascendingbins of human-mouse orthologous gene substitution rates

Supplementary Figure 2mdashFrequency distributions ofhuman-mouse substitution rate ratios Ratios shown inlog scale (a) d(59 UTR)d(39 UTR) d(59 UTR)d(39 UTR)is shown in log scale In 43 of genes d(59 UTR) d(39UTR) whereas in 57 d(39 UTR) d(59 UTR) (b)d(59 UTR)dS (c) d(39 UTRdS)

Supplementary Figure 3mdashFrequency distributions ofthe number of links per node in human gene coexpressionnetworks For each network the correlation coefficientvalue (r) threshold used to determine whether any twogenes (nodes) are considered to be coexpressed (linked)are shown above the plot The slope of the linear trend linethat fits the data and the r2 value indicating the goodness offit to the power law are shown for each network plot

Supplementary Figure 4mdashTissue-specific expressionpatterns for human gene coexpression network (r 09)clusters Cluster expression patterns 1 adult and fetal liver2 fetal liver 3 testis 4 pancreas 5 testis and umbilicalvein 6 various brain tissues 7 salivary gland 8 adultliver 9 thymus 10 spleen 11 lung 12 cerebellum

Supplementary Figure 5mdashRelationship between themouse coexpression network topology and substitutionrates (a) The dependence between the node degree andsubstitution rate Average (6 standard error) numbers ofcoexpressed genes (node degree for r 07) per gene for10 ascending bins of human-mouse orthologous gene sub-stitution rates are shown (b) Coexpressedmouse genes havesimilar substitution rates Fractions of pairwise correlation

coefficient values where r 07 between mouse genes areshown for 10 ascending bins of pairwise human-mouseorthologous gene substitution rate differences

Literature Cited

Agrawal H 2002 Extreme self-organization in networksconstructed from gene expression data Phys Rev Lett89268702

Ashburner M C A Ball J A Blake et al 2000 Geneontology tool for the unification of biology The GeneOntology Consortium Nat Genet 2525ndash29

Barabasi A L and R Albert 1999 Emergence of scaling inrandom networks Science 286509ndash512

Barabasi A L and Z N Oltvai 2004 Network biologyunderstanding the cellrsquos functional organization Nat RevGenet 5101ndash113

Bergmann S J Ihmels and N Barkai 2004 Similarities anddifferences in genome-wide expression data of six organismsPLoS Biol 20085ndash0093

Bhan A D J Galas and T G Dewey 2002 A duplicationgrowth model of gene expression networks Bioinformatics181486ndash1493

Bloom J D and C Adami 2003 Apparent dependence ofprotein evolutionary rate on number of interactions is linkedto biases in protein-protein interactions data sets BMC EvolBiol 321

Britten R J and E H Davidson 1969 Gene regulation forhigher cells a theory Science 165349ndash357

Carbone A A Zinovyev and F Kepes 2003 Codon adaptationindex as a measure of dominating codon bias Bioinformatics192005ndash2015

Darwin C 1859 On the origin of species John Murray LondonDuret L and D Mouchiroud 2000 Determinants of sub-

stitution rates in mammalian genes expression pattern affectsselection intensity but not mutation rate Mol Biol Evol1768ndash74

Eisen M B P T Spellman P O Brown and D Botstein1998 Cluster analysis and display of genome-wide expressionpatterns Proc Natl Acad Sci USA 9514863ndash14868

Farris J S 1977 Phylogenetic analysis under Dollorsquos law SystZool 2677ndash88

Fraser H B A E Hirsh L M Steinmetz C Scharfe andM W Feldman 2002 Evolutionary rate in the protein inter-action network Science 296750ndash752

Fraser H B D P Wall and A E Hirsh 2003 A simpledependence between protein evolution rate and the number ofprotein-protein interactions BMC Evol Biol 311

Giot L J S Bader C Brouwer et al 2003 A protein inter-action map of Drosophila melanogaster Science 30217271736

Gu Z D Nicolae H H Lu and W H Li 2002 Rapiddivergence in expression between duplicate genes inferredfrom microarray data Trends Genet 18609ndash613

Hedges S B H Chen S Kumar D Y Wang A S Thompsonand H Watanabe 2001 A genomic timescale for the origin ofeukaryotes BMC Evol Biol 14

Higgins D G J D Thompson and T J Gibson 1996 UsingCLUSTAL for multiple sequence alignments MethodsEnzymol 266383ndash402

Hirsh A E and H B Fraser 2001 Protein dispensability andrate of evolution Nature 4111046ndash1049

Ho Y A Gruhler A Heilbut et al 2002 Systematic iden-tification of protein complexes in Saccharomyces cerevisiaeby mass spectrometry Nature 415180ndash183

Evolution and the Human Gene Coexpression Network 2069

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Hoyle DCMRattray R Jupp andABrass 2002Making senseof microarray data distributions Bioinformatics 18576ndash584

Jeong H S P Mason A L Barabasi and Z N Oltvai 2001Lethality and centrality in protein networks Nature 41141ndash42

Jeong H B Tombor R Albert Z N Oltvai and A LBarabasi 2000 The large-scale organization of metabolicnetworks Nature 407651ndash654

Jordan I K I B Rogozin Y I Wolf and E V Koonin 2002Essential genes are more evolutionarily conserved than arenonessential genes in bacteria Genome Res 12962ndash968

Jordan I K Y I Wolf and E V Koonin 2003 No simpledependence between protein evolution rate and the number ofprotein-protein interactions only the most prolific interactorstend to evolve slowly BMC Evol Biol 31

Jukes T H and C R Cantor 1969 Evolution of proteinmolecules Pp 21ndash132 in H N Munro ed Mammalianprotein metabolism Academic New York

Kamath R S A G Fraser Y Dong et al 2003 Systematicfunctional analysis of the Caenorhabditis elegans genomeusing RNAi Nature 421231ndash237

Karev G P Y I Wolf A Y Rzhetsky F S Berezovskayaand E V Koonin 2002 Birth and death of protein domainsa simple model of evolution explains power law behaviorBMC Evol Biol 218

Khaitovich P G Weiss M Lachmann I Hellman W EnardB Muetzel U Wiekner W Ansorge and S Paabo 2004 Aneutral model of transcriptome evolution PLOS Biol20682ndash0689

King M C and A C Wilson 1975 Evolution at two levels inhumans and chimpanzees Science 188107ndash116

Koonin E V N D Fedorova J D Jackson et al 2004 Acomprehensive evolutionary classification of proteins encodedin complete eukaryotic genomes Genome Biol 5R7

Koonin E V Y I Wolf and G P Karev 2002 The structureof the protein universe and genome evolution Nature420218ndash223

Krylov D M Y I Wolf I B Rogozin and E V Koonin2003 Gene loss protein sequence divergence gene dispens-ability expression level and interactivity are correlated ineukaryotic evolution Genome Res 132229ndash2235

Kuznetsov V A G D Knott and R F Bonner 2002 Generalstatistics of stochastic process of gene expression ineukaryotic cells Genetics 1611321ndash1332

Lander E S L M Linton B Birren et al 2001 Initialsequencing and analysis of the human genome Nature 409860ndash921

Li S C M Armstrong N Bertin et al 2004 A map of theinteractome network of the metazoan C elegans Science 303540ndash543

Li W H 1997 Molecular evolution Sinauer AssociatesSunderland Mass

Luscombe N M J Qian Z Zhang T Johnson and MGerstein 2002 The dominance of the population by a selected

few power-law behaviour applies to a wide variety ofgenomic properties Genome Biol3RESEARCH0040

Makova K D and W H Li 2003 Divergence in the spatialpattern of gene expression between human duplicate genesGenome Res 131638ndash1645

Nei M and T Gojobori 1986 Simple methods for estimatingthe numbers of synonymous and nonsynonymous nucleotidesubstitutions Mol Biol Evol 3418ndash426

Pal C B Papp and L D Hurst 2001 Highly expressed genesin yeast evolve slowly Genetics 158927ndash931

mdashmdashmdash 2003 Genomic function rate of evolution and genedispensability Nature 421496ndash497

Pennisi E 2003 Systems biology tracing lifersquos circuitryScience 3021646ndash1649

Ravasz E A L Somera D A Mongru Z N Oltvai andA L Barabasi 2002 Hierarchical organization of modularityin metabolic networks Science 2971551ndash1555

Rzhetsky A and S M Gomez 2001 Birth of scale-freemolecular networks and the number of distinct DNA andprotein domains per genome Bioinformatics 17988ndash996

Su A I M P Cooke K A Ching et al 2002 Large-scaleanalysis of the human and mouse transcriptomes Proc NatlAcad Sci USA 994465ndash4470

Tatusov R L N D Fedorova J D Jackson et al 2003 TheCOG database an updated version includes eukaryotes BMCBioinformatics 441

Ueda H R S Hayshi S Matsuyama T Yomo S HashimotoS A Kay J B Hogenesch and M Iino 2004 Universalityand flexibility in gene expression from bacteria to humanProc Natl Acad Sci USA 1013765ndash3769

Wagner A 2000 Decoupled evolution of coding region andmRNA expression patterns after gene duplication implica-tions for the neutralist-selectionist debate Proc Natl AcadSci USA 976579ndash6584

mdashmdashmdash 2003 How the global structure of protein interaction net-works evolves Proc R Soc Lond B Biol Sci 270457ndash466

Waterston R H K Lindblad-Toh E Birney et al 2002 Initialsequencing and comparative analysis of the mouse genomeNature 420520ndash562

Yanai I D Graur and R Ophir 2004 Incongruent expressionprofiles between human and mouse orthologous genes suggestwidespread neutral evolution of transcription control OMICS815ndash24

Yang J ZGu andWHLi 2003Rate of protein evolution versusfitness effect of gene deletion Mol Biol Evol 20772ndash774

Yang Z 1997 PAML a programpackage for phylogenetic analysisby maximum likelihood Comput Appl Biosci 13555ndash556

Zhang L and W H Li 2004 Mammalian housekeeping genesevolve more slowly than tissue-specific genes Mol BiolEvol 21236ndash239

Peer Bork Associate Editor

Accepted July 14 2004

2070 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Page 7: Conservation and Coevolution in the Scale-Free Human Gene

Although the correlation between dN and the numberof coexpressed genes was nearly as strong as that betweendN and expression breadth and level for dS the lattercorrelation was substantially stronger (table 1) This is notsurprising because codon usage adaptation is particularlyimportant for highly expressed genes (Carbone Zinovyevand Kepes 2003) Partial correlation rXYZ where X frac14number of coexpressed genes Y frac14 substitution rate andZfrac14 expression level was used to control for the effects ofexpression level on the relationship between the number ofcoexpressed genes and substitution rates In all caseswhere the substitution rate was found to be significantlycorrelated with the number of coexpressed genesmdashdNdS and d(39 UTR) (table 1)mdashthe application of partialcorrelation did not result in any significant decrease of thecorrelation coefficient that is rXYZ is not significantlysmaller than rXY (038 P 056) Thus the effect ofthe number of coexpressed genes on substitution rates isnot based on any correlation between the number ofcoexpressed genes and the expression level

The relationship between gene coexpression andevolutionary rate was further examined by comparing ther-values between pairs of human gene expression patternswith their pairwise gene substitution rate differences (fig 6and table 1) The results show that bins of human geneswith similar values of dN between human and mouse (iethose genes that evolve at similar rates) have a greaterfraction of pairwise r-values 07 This negative

correlation was substantial and statistically significant(fig 6 and table 1) In contrast the pairwise dS differencebetween human genes did not correlate with the differencein expression patterns (fig 6) Thus genes with similarpatterns of expression tend to evolve at similar rates andthe selection that leads to such coevolution appears tooperate primarily at the protein sequence level The dif-ference in evolution rates of 59 and 39 UTRs also neg-atively correlated with pairwise r-values between humangene expression patterns although the effect statisticallysignificant because of the large number of analyzed genepairs (table 1) was not nearly as pronounced as seen fordN (fig 6) As with other comparisons between evolu-tionary rate and expression patterns the magnitude of thisrelationship was greater for 39 UTRs suggesting that thisregion is more functionally relevant than the 59 UTR withrespect to the pattern of gene expression Qualitativelyidentical results are seen when the mouse expression datawas used to compare r-values between pairs of mousegenes with pairwise human-mouse gene substitution ratedifferences (see supplementary figure 5b)

The connection between gene coexpression andevolutionary conservation on a longer timescale wasinvestigated by considering the expression data alongwith the patterns of phyletic distribution inferred fromthe recently developed collection of eukaryotic KOGs(Tatusov et al 2003 Koonin et al 2004) Each humangene was mapped to a specific KOG and assigned

FIG 4mdashTwo highly connected gene expression clusters form the human gene coexpression network (a) Nodes correspond to genes and arelabeled with LocusLink ID numbers All links between genes correspond to coexpression at r 09 (b) Tissue-specific gene expression patterns of thecluster members Genes (rows) are labeled with LocusLink ID numbers Relative levels (log2 ratios) of expression are shown for each gene in eachtissue green cells show underexpression relative to the median black cells show median expression levels and red cells show overexpression

2064 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

node

deg

ree

0

10

20

30

40

50

60

001 002 003 004 005 007 009 012 018 1190

10

20

30

40

50

60

039 045 049 053 056 060 064 070 079 5980

10

20

30

40

50

60

015 021 026 030 034 038 042 048 053 0690

10

20

30

40

50

60

015 020 024 027 031 034 037 041 048 062

dSdN 3rsquo UTR d 5rsquo UTR d

FIG 5mdashThe dependence between the node degree in the human coexpression network and substitution rate Average (6 standard error) numbers of coexpressed human genes (r 07) per gene for 10ascending bins of human-mouse orthologous gene substitution rates are shown

fract

ion

r gt=

07

0008

0012

0016

002

001 002 003 004 005 007 009 013 019 1190008

0012

0016

002

003 006 008 012 015 019 024 030 041 5880008

0012

0016

002

005 010 015 021 027 034 042 053 069 1950008

0012

0016

002

004 008 012 016 021 026 033 042 056 133

FIG 6mdashCoexpressed human genes have similar substitution rates Fractions of pairwise correlation coefficient values where r 07 between human genes are shown for 10 ascending bins of pairwisehuman-mouse orthologous gene substitution rate differences

Evolutio

nandtheHuman

GeneCoexpressio

nNetw

ork

2065

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

a phyletic pattern that is the pattern of presenceabsenceof orthologous genes from the given KOG among seveneukaryotic species The phyletic patterns of KOGs werecompared to determine their similarity in terms ofpropensities for gene gain and loss over the approximately18 billion years of eukaryotic crown group evolution(Krylov et al 2003) A statistically significant positivecorrelation was detected between the similarity of evolu-tionary patterns of gain-loss and the level of coexpressionamong pairs of human genes (fig 7 and table 1) Thuscoexpressed genes tend to evolve under similar long-termevolutionary constraints

Human-Mouse Sequence Divergence Versus ExpressionProfile Divergence

Expression data collected for human and mouse werefurther compared to assess levels of regulatory divergence

(divergence of expression profiles) between orthologousgenes in the two species From the original series ofmicroarray experiments (Su et al 2002) 19 tissues thatwere studied in both human and mouse were identifiedRelative levels of gene expression across these tissues inboth species were taken as vectors and pairs of vectorsfor human-mouse orthologous genes were compared tomeasure the extent of regulatory divergence betweenspecies In a general agreement with the original report (Suet al 2002) the r-values for orthologs displayed a narrowdistribution with the median at approximately 04 Thiswas in sharp contrast to the distribution of r-values fornonorthologous human and mouse genes which hada median of approximately 0 (fig 8) The differencebetween the two distributions is highly statisticallysignificant (P 10210) Surprisingly however whenthe resulting r-values were compared with the levels ofsequence divergence to determine whether sequence andregulatory divergence were correlated no significantcorrelation between the human-mouse expression patterndivergence and the four different measures dN dSd(59 UTR) and d(39 UTR) of human-mouse sequencedivergence was detected (fig 9 and table 1) This findingstands in contrast with the results of the recent analysis ofthe connection between the sequence and expressiondivergence between paralogous human genes where asignificant negative correlation was observed betweenthe correlation coefficients of the expression profiles andsequence divergence (Makova and Li 2003) However anearlier study of yeast duplicate genes found no correlation

phyletic similarity

0008

001

0012

0014

0016

0018

002

-079 -066-027-022 -010 018 030 062 099 100

fract

ion

r gt=

07

FIG 7mdashCoexpressed human genes have similar phyletic patternsThe fractions of coexpressed gene pairs (r 07) are shown for 10ascending bins of phyletic similarity values (see Materials and Methods)

Table 1Evolutionary Rate Versus Gene Expression Comparisons

Comparison Ra Pb nc

dN vs expression breadth 21000 55endash7 3906dS vs expression breadth 20976 21endash5 390659 UTR d vs expression breadth 20539 ns 112139 UTR d vs expression breadth 20964 49endash5 1438

dN vs expression level 21000 55endash7 3906dS vs expression level 20952 11endash4 390659 UTR d vs expression level 20624 ns 112139 UTR d vs expression level 20964 49endash5 1438

dN vs number coexpressed genes 20964 49endash5 2307dS vs number coexpressed genes 20746 17endash2 230759 UTR d vs number coexpressedgenes 20588 ns 1119

39 UTR d vs number coexpressedgenes 20818 58endash3 1436

dN difference vs fraction r 07 20976 21endash5 2659971dS difference vs fraction r 07 20249 ns 265997159 UTR d difference vs fractionr 07 20812 58endash3 625521

39 UTR d difference vs fractionr 07 20867 22endash3 1030330

dN vs expression divergence 0139 ns 551dS vs expression divergence 20079 ns 55159 UTR d vs expression divergence 0164 ns 27339 UTR d vs expression divergence 20552 ns 382

a Spearman rank correlation coefficient for the average expression parameter

value against the 10 bins of substitution rate values (see figures 1 and 3ndash5)b Level of significance (n frac14 10) associated with R values ns frac14 nonsignifi-

cant (ie P 005)c The total number of comparisons that were binned

0

01

02

03

04

05

06

07

08

09

1

-1 -05 0 05 1r

cum

ulat

ive

frequ

ency

FIG 8mdashCumulative distributions of pairwise correlation coefficientvalues (r) The black line is the distribution for the comparisons ofnonorthologous genes within the human and mouse gene sets The grayline is the distribution for orthologous human-mouse gene comparisons

2066 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

between gene sequence and gene expression patterndivergence (Wagner 2000) The apparent disparity be-tween orthologs and paralogs in terms of expressionevolution in mammals seems to suggest unexpecteddifferences in the evolutionary modes and deserves furtherinvestigation

Conclusion

The results described here indicate that mammaliancoexpressed genes form a scale-free network whichprobably evolved by self-organization based on thepreferential attachment principle One possible mechanism

of this evolution is the duplication of genes together withtheir transcriptional control regions as suggested for theyeast coexpression network (Bhan Galas and Dewey2002) An examination of the contribution of paralogousgene pairs to the gene coexpression network analyzed hereindicates that such duplicated pairs (see Materials andMethods) are found approximately two times morefrequently than expected by chance (P 10210)However the fraction of coexpressed gene pairs that arerelated by duplication is small (028) suggesting thatduplication may not be a major force affecting thestructure of the gene coexpression network This findingseems to be consistent with an analysis of yeast protein

Table 2Examples of Two Human Gene Coexpression Network Clusters

Locus Link IDGenBank accessiona Gene Symbolb Annotationc Physiological Function

A Pancreas-Specific Gene Cluster

1208P16233 CLPS colipase preproprotein Pancreatic digestive enzyme1357NP_001859 CPA1 pancreatic carboxypeptidase A1 precursor Pancreatic digestive enzyme1358NP_001860 CPA2 pancreatic carboxypeptidase A2 precursor Pancreatic digestive enzyme2641P01275 GCG glucagon preproprotein Pancreatic islet hormone2813AAB19240 GP2 glycoprotein 2 (zymogen granule

membrane)The major membrane protein in thesecretory granule of theexocrine pancreas

2906AAB49992 GRIN2D N-methyl-D-aspartate receptor subunit2D precursor

Implicated in pancreatic hormonesecretionPMID 7768362

148223 NP_689695 hypothetical protein FLJ36666 Uncharacterized protein25803NP_036523 SPDEF SAM pointed domainndashcontaining ets

transcription factorProstate epithelium-specific transcriptionfactor

84444NP_115871 histone methyltransferase DOT1L Regulator of gene dynamics andchromatin activity

9436NP_004819 NCR2 natural cytotoxicity triggering receptorndash2 Mediator of killer cellsrsquocytotoxicity9610AAB67270 RIN1 ras inhibitor RIN1 A dominant negative inhibitor of the

RAS signaling pathway9753AAH41661 ZNF305 zinc finger proteinndash305 Predicted transcription regulator

A Testis-Specific Gene Cluster

6847Q15431 SYCP1 synaptonemal complex proteinndash1 Major component of the transversefilaments of synaptonemalcomplexes during meiotic prophase

10388 Q9BX26 SYCP2 synaptonemal complex proteinndash2 Major component of the axiallateralelements of synaptonemal complexesduring meiotic prophase

27285NP_055281 TEKT2 tektin 2 Major sperm microtubule protein8852NP_647450 AKAP4 A-kinase anchor proteinndash4 isoform 2 Major sperm sheath protein676 AAB87862 BRDT testis-specific bromodomain protein Testis-specific chromatin-associated protein

involved in chromatin remodeling11077 O75031 HSF2BP heat shock transcription factorndash2 binding

proteinImplicated in modulating HSF2activation in testis

7180P16562 CRISP2 testis-specific proteinndash1 Probable sperm-coating glycoprotein11116NP_919410 FGFR1OP FGFR1 oncogene partner isoform b Implicated in the proliferation and

differentiation of the erythroid lineage4291P58340 MLF1 myeloid leukemia factorndash1 Uncharacterized protein implicated in

cell proliferation5261AAH02541 PHKG2 phosphorylase kinase gamma 2 (testis) Testisliver-specific regulator of

glycogen metabolism54344 O94777 DPM3 dolichyl-phosphate mannosyltransferase

polypeptidendash3 isoform 2Regulator of dolichyl-phosphate mannosebiosynthesis Might be important forsperm maturation PMID 6231179

7288O00295 TULP2 tubby-like proteinndash2 Retina-specific and testis-specificheterotrimeric G-proteinndashresponsiveintracellular signaling factor

51460NP_057413 SFMBT1 Scm-like with four mbt domainsndash1 Polycomb group transcription regulator114049 NP_684281 WBSCR22 Williams Beuren syndrome chromosome

region 22 proteintRNA 5-cytosine methylase

a Locus Link identification numbers and GenBank accessions from NCBIrsquos LocusLink database httpwwwncbinlmnihgovLocusLinkindexhtmlb Official gene symbol from the Human Genome Organization (HUGO) httpwwwgeneuclacuknomenclaturec Official HUGO gene name

Evolution and the Human Gene Coexpression Network 2067

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

interaction networks which showed that the networkstructure is not predominantly shaped by duplication(Wagner 2003)

Our results show that overall levels of expressionand more specific topological properties of the gene co-expression network are clearly related to the rates ofsequence evolution highly connected network hubs tendto evolve slowly and genes that are coexpressed showa strong tendency to evolve at similar rates Theserelationships most likely reflect purifying selection whichappears to act primarily at the level of the proteinsequence Generally these results are compatible withthe notion of greater biological importance of highlyconnected nodes of biological networks (Jeong et al2001) However two of the findings reported here appearto be distinctly surprising Firstly we found that theconnection between gene coexpression network topologyand sequence evolution held for both nonsynonymous andsynonymous sites in the coding sequence and 39 UTR butnot for the 59 UTR Thus the constraints on 59 UTRevolution appear to be unrelated to expression regulationas analyzed here Secondly we found that although theexpression profiles of orthologs tend to be stronglycorrelated they diverged at roughly the same rate acrossa wide range of sequence evolution rates Thus unlike theproperties of the gene coexpression network evolution ofan individual genersquos expression regulation after speciationseems to be uncoupled from the functional constraints onthe genersquos sequence This is generally compatible withthe pregenomic notion that regulatory evolution could bethe decisive factor of biological diversification betweenspecies (Britten and Davidson 1969 King and Wilson1975) and can be taken to suggest that sequence diver-gence and expression pattern divergence are governed bydistinct forces

Recent studies have reported the rapid diversificationof gene expression patterns and suggested that this mightreflect neutral evolution of transcriptional regulation(Khaitovich et al 2004 Yanai Graur and Ophir 2004)The lack of correlation between sequence divergence andexpression profile divergence demonstrated here could bedeemed as compatible with the neutral model wherebytranscription profiles of orthologs diverge rapidly uponspeciation until they reach a basal level of similarity that isthen maintained by purifying selection However it istempting to speculate as to a distinct role for naturalselection in driving expression pattern divergence Perhapswhereas sequence divergence levels are determined largelyby purifying selection the effects of adaptive (diversifying)selection are more prevalent at the level of gene expressionUnder this hypothetical scenario although the basal level ofcorrelation between orthologs tends to persist probablyreflecting the conservation of the general function theaspects of the gene expression patterns that reflect species-specific changes governed in part by short degeneratetranscription factorndashbinding sites would be expected todiverge rapidly Indeed our analysis of the relationshipbetween gene coexpression in this work and geneduplication as well as previous results (Gu et al 2002)suggest rapid divergence of expression patterns afterduplication Once the expression patterns of homologous

correlation 01

02

03

04

05

06

001

002

003

004

005

007

009

012

019

052

01

02

03

04

05

06

040

046

050

055

058

062

065

071

079

127

01

02

03

04

05

06

017

024

032

037

045

053

063

072

088

137

01

02

03

04

05

06

017

023

029

034

040

044

050

059

073

120

3rsquo U

TR d

5rsquo

UTR

d

dNdS

FIG9mdash

Thelack

ofdependence

betweenhuman-m

ouse

generegulatory

divergence

andsequence

divergenceAveragecorrelationcoefficientvalues

(r)betweenexpressionprofilesofhuman-m

ouse

orthologousgeneareshownfor10ascendingbinsofhuman-m

ouse

orthologousgenesubstitutionrates

2068 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

genes (paralogs or orthologs) diverge convergent evolu-tion which is a hallmark of adaptation and is widely evidentfor phenotypic characters could cause unrelated genes toachieve similar expression patterns This convergence couldbe related to the rapid de novo generation of transcriptionfactorndashbinding sites in promoters that lead to slight changesin expression patterns If the functionally relevant aspectsof the ancestral expression pattern are maintained duringthis process subtle expression pattern changes would besimultaneously invisible to purifying selection and serve asthe raw material for repeated trials of adaptive selectionThis hypothesis yields a specific prediction with regard tothe evolution of promoter regions that is currently beinginvestigated Expression profiles are predicted to be morecorrelated with the distributions of specific transcriptionfactorndashbinding sites along promoters than with the overallsequence divergence (ie relatedness) between promoters

Acknowledgments

We thank Igor Rogozin David Landsman FyodorKondrashov Itai Yanai and Katerina Makova for helpfuldiscussions

Supplementary Material

Supplementary Figure 1mdashThe dependence betweenexpression breadth level and substitution rates of mousegenes (andashd) Average (6 standard error) expressionbreadth values for 10 ascending bins of human-mouseorthologous gene substitution rates (endashh) Average (6standard error) expression level values for 10 ascendingbins of human-mouse orthologous gene substitution rates

Supplementary Figure 2mdashFrequency distributions ofhuman-mouse substitution rate ratios Ratios shown inlog scale (a) d(59 UTR)d(39 UTR) d(59 UTR)d(39 UTR)is shown in log scale In 43 of genes d(59 UTR) d(39UTR) whereas in 57 d(39 UTR) d(59 UTR) (b)d(59 UTR)dS (c) d(39 UTRdS)

Supplementary Figure 3mdashFrequency distributions ofthe number of links per node in human gene coexpressionnetworks For each network the correlation coefficientvalue (r) threshold used to determine whether any twogenes (nodes) are considered to be coexpressed (linked)are shown above the plot The slope of the linear trend linethat fits the data and the r2 value indicating the goodness offit to the power law are shown for each network plot

Supplementary Figure 4mdashTissue-specific expressionpatterns for human gene coexpression network (r 09)clusters Cluster expression patterns 1 adult and fetal liver2 fetal liver 3 testis 4 pancreas 5 testis and umbilicalvein 6 various brain tissues 7 salivary gland 8 adultliver 9 thymus 10 spleen 11 lung 12 cerebellum

Supplementary Figure 5mdashRelationship between themouse coexpression network topology and substitutionrates (a) The dependence between the node degree andsubstitution rate Average (6 standard error) numbers ofcoexpressed genes (node degree for r 07) per gene for10 ascending bins of human-mouse orthologous gene sub-stitution rates are shown (b) Coexpressedmouse genes havesimilar substitution rates Fractions of pairwise correlation

coefficient values where r 07 between mouse genes areshown for 10 ascending bins of pairwise human-mouseorthologous gene substitution rate differences

Literature Cited

Agrawal H 2002 Extreme self-organization in networksconstructed from gene expression data Phys Rev Lett89268702

Ashburner M C A Ball J A Blake et al 2000 Geneontology tool for the unification of biology The GeneOntology Consortium Nat Genet 2525ndash29

Barabasi A L and R Albert 1999 Emergence of scaling inrandom networks Science 286509ndash512

Barabasi A L and Z N Oltvai 2004 Network biologyunderstanding the cellrsquos functional organization Nat RevGenet 5101ndash113

Bergmann S J Ihmels and N Barkai 2004 Similarities anddifferences in genome-wide expression data of six organismsPLoS Biol 20085ndash0093

Bhan A D J Galas and T G Dewey 2002 A duplicationgrowth model of gene expression networks Bioinformatics181486ndash1493

Bloom J D and C Adami 2003 Apparent dependence ofprotein evolutionary rate on number of interactions is linkedto biases in protein-protein interactions data sets BMC EvolBiol 321

Britten R J and E H Davidson 1969 Gene regulation forhigher cells a theory Science 165349ndash357

Carbone A A Zinovyev and F Kepes 2003 Codon adaptationindex as a measure of dominating codon bias Bioinformatics192005ndash2015

Darwin C 1859 On the origin of species John Murray LondonDuret L and D Mouchiroud 2000 Determinants of sub-

stitution rates in mammalian genes expression pattern affectsselection intensity but not mutation rate Mol Biol Evol1768ndash74

Eisen M B P T Spellman P O Brown and D Botstein1998 Cluster analysis and display of genome-wide expressionpatterns Proc Natl Acad Sci USA 9514863ndash14868

Farris J S 1977 Phylogenetic analysis under Dollorsquos law SystZool 2677ndash88

Fraser H B A E Hirsh L M Steinmetz C Scharfe andM W Feldman 2002 Evolutionary rate in the protein inter-action network Science 296750ndash752

Fraser H B D P Wall and A E Hirsh 2003 A simpledependence between protein evolution rate and the number ofprotein-protein interactions BMC Evol Biol 311

Giot L J S Bader C Brouwer et al 2003 A protein inter-action map of Drosophila melanogaster Science 30217271736

Gu Z D Nicolae H H Lu and W H Li 2002 Rapiddivergence in expression between duplicate genes inferredfrom microarray data Trends Genet 18609ndash613

Hedges S B H Chen S Kumar D Y Wang A S Thompsonand H Watanabe 2001 A genomic timescale for the origin ofeukaryotes BMC Evol Biol 14

Higgins D G J D Thompson and T J Gibson 1996 UsingCLUSTAL for multiple sequence alignments MethodsEnzymol 266383ndash402

Hirsh A E and H B Fraser 2001 Protein dispensability andrate of evolution Nature 4111046ndash1049

Ho Y A Gruhler A Heilbut et al 2002 Systematic iden-tification of protein complexes in Saccharomyces cerevisiaeby mass spectrometry Nature 415180ndash183

Evolution and the Human Gene Coexpression Network 2069

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Hoyle DCMRattray R Jupp andABrass 2002Making senseof microarray data distributions Bioinformatics 18576ndash584

Jeong H S P Mason A L Barabasi and Z N Oltvai 2001Lethality and centrality in protein networks Nature 41141ndash42

Jeong H B Tombor R Albert Z N Oltvai and A LBarabasi 2000 The large-scale organization of metabolicnetworks Nature 407651ndash654

Jordan I K I B Rogozin Y I Wolf and E V Koonin 2002Essential genes are more evolutionarily conserved than arenonessential genes in bacteria Genome Res 12962ndash968

Jordan I K Y I Wolf and E V Koonin 2003 No simpledependence between protein evolution rate and the number ofprotein-protein interactions only the most prolific interactorstend to evolve slowly BMC Evol Biol 31

Jukes T H and C R Cantor 1969 Evolution of proteinmolecules Pp 21ndash132 in H N Munro ed Mammalianprotein metabolism Academic New York

Kamath R S A G Fraser Y Dong et al 2003 Systematicfunctional analysis of the Caenorhabditis elegans genomeusing RNAi Nature 421231ndash237

Karev G P Y I Wolf A Y Rzhetsky F S Berezovskayaand E V Koonin 2002 Birth and death of protein domainsa simple model of evolution explains power law behaviorBMC Evol Biol 218

Khaitovich P G Weiss M Lachmann I Hellman W EnardB Muetzel U Wiekner W Ansorge and S Paabo 2004 Aneutral model of transcriptome evolution PLOS Biol20682ndash0689

King M C and A C Wilson 1975 Evolution at two levels inhumans and chimpanzees Science 188107ndash116

Koonin E V N D Fedorova J D Jackson et al 2004 Acomprehensive evolutionary classification of proteins encodedin complete eukaryotic genomes Genome Biol 5R7

Koonin E V Y I Wolf and G P Karev 2002 The structureof the protein universe and genome evolution Nature420218ndash223

Krylov D M Y I Wolf I B Rogozin and E V Koonin2003 Gene loss protein sequence divergence gene dispens-ability expression level and interactivity are correlated ineukaryotic evolution Genome Res 132229ndash2235

Kuznetsov V A G D Knott and R F Bonner 2002 Generalstatistics of stochastic process of gene expression ineukaryotic cells Genetics 1611321ndash1332

Lander E S L M Linton B Birren et al 2001 Initialsequencing and analysis of the human genome Nature 409860ndash921

Li S C M Armstrong N Bertin et al 2004 A map of theinteractome network of the metazoan C elegans Science 303540ndash543

Li W H 1997 Molecular evolution Sinauer AssociatesSunderland Mass

Luscombe N M J Qian Z Zhang T Johnson and MGerstein 2002 The dominance of the population by a selected

few power-law behaviour applies to a wide variety ofgenomic properties Genome Biol3RESEARCH0040

Makova K D and W H Li 2003 Divergence in the spatialpattern of gene expression between human duplicate genesGenome Res 131638ndash1645

Nei M and T Gojobori 1986 Simple methods for estimatingthe numbers of synonymous and nonsynonymous nucleotidesubstitutions Mol Biol Evol 3418ndash426

Pal C B Papp and L D Hurst 2001 Highly expressed genesin yeast evolve slowly Genetics 158927ndash931

mdashmdashmdash 2003 Genomic function rate of evolution and genedispensability Nature 421496ndash497

Pennisi E 2003 Systems biology tracing lifersquos circuitryScience 3021646ndash1649

Ravasz E A L Somera D A Mongru Z N Oltvai andA L Barabasi 2002 Hierarchical organization of modularityin metabolic networks Science 2971551ndash1555

Rzhetsky A and S M Gomez 2001 Birth of scale-freemolecular networks and the number of distinct DNA andprotein domains per genome Bioinformatics 17988ndash996

Su A I M P Cooke K A Ching et al 2002 Large-scaleanalysis of the human and mouse transcriptomes Proc NatlAcad Sci USA 994465ndash4470

Tatusov R L N D Fedorova J D Jackson et al 2003 TheCOG database an updated version includes eukaryotes BMCBioinformatics 441

Ueda H R S Hayshi S Matsuyama T Yomo S HashimotoS A Kay J B Hogenesch and M Iino 2004 Universalityand flexibility in gene expression from bacteria to humanProc Natl Acad Sci USA 1013765ndash3769

Wagner A 2000 Decoupled evolution of coding region andmRNA expression patterns after gene duplication implica-tions for the neutralist-selectionist debate Proc Natl AcadSci USA 976579ndash6584

mdashmdashmdash 2003 How the global structure of protein interaction net-works evolves Proc R Soc Lond B Biol Sci 270457ndash466

Waterston R H K Lindblad-Toh E Birney et al 2002 Initialsequencing and comparative analysis of the mouse genomeNature 420520ndash562

Yanai I D Graur and R Ophir 2004 Incongruent expressionprofiles between human and mouse orthologous genes suggestwidespread neutral evolution of transcription control OMICS815ndash24

Yang J ZGu andWHLi 2003Rate of protein evolution versusfitness effect of gene deletion Mol Biol Evol 20772ndash774

Yang Z 1997 PAML a programpackage for phylogenetic analysisby maximum likelihood Comput Appl Biosci 13555ndash556

Zhang L and W H Li 2004 Mammalian housekeeping genesevolve more slowly than tissue-specific genes Mol BiolEvol 21236ndash239

Peer Bork Associate Editor

Accepted July 14 2004

2070 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Page 8: Conservation and Coevolution in the Scale-Free Human Gene

node

deg

ree

0

10

20

30

40

50

60

001 002 003 004 005 007 009 012 018 1190

10

20

30

40

50

60

039 045 049 053 056 060 064 070 079 5980

10

20

30

40

50

60

015 021 026 030 034 038 042 048 053 0690

10

20

30

40

50

60

015 020 024 027 031 034 037 041 048 062

dSdN 3rsquo UTR d 5rsquo UTR d

FIG 5mdashThe dependence between the node degree in the human coexpression network and substitution rate Average (6 standard error) numbers of coexpressed human genes (r 07) per gene for 10ascending bins of human-mouse orthologous gene substitution rates are shown

fract

ion

r gt=

07

0008

0012

0016

002

001 002 003 004 005 007 009 013 019 1190008

0012

0016

002

003 006 008 012 015 019 024 030 041 5880008

0012

0016

002

005 010 015 021 027 034 042 053 069 1950008

0012

0016

002

004 008 012 016 021 026 033 042 056 133

FIG 6mdashCoexpressed human genes have similar substitution rates Fractions of pairwise correlation coefficient values where r 07 between human genes are shown for 10 ascending bins of pairwisehuman-mouse orthologous gene substitution rate differences

Evolutio

nandtheHuman

GeneCoexpressio

nNetw

ork

2065

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

a phyletic pattern that is the pattern of presenceabsenceof orthologous genes from the given KOG among seveneukaryotic species The phyletic patterns of KOGs werecompared to determine their similarity in terms ofpropensities for gene gain and loss over the approximately18 billion years of eukaryotic crown group evolution(Krylov et al 2003) A statistically significant positivecorrelation was detected between the similarity of evolu-tionary patterns of gain-loss and the level of coexpressionamong pairs of human genes (fig 7 and table 1) Thuscoexpressed genes tend to evolve under similar long-termevolutionary constraints

Human-Mouse Sequence Divergence Versus ExpressionProfile Divergence

Expression data collected for human and mouse werefurther compared to assess levels of regulatory divergence

(divergence of expression profiles) between orthologousgenes in the two species From the original series ofmicroarray experiments (Su et al 2002) 19 tissues thatwere studied in both human and mouse were identifiedRelative levels of gene expression across these tissues inboth species were taken as vectors and pairs of vectorsfor human-mouse orthologous genes were compared tomeasure the extent of regulatory divergence betweenspecies In a general agreement with the original report (Suet al 2002) the r-values for orthologs displayed a narrowdistribution with the median at approximately 04 Thiswas in sharp contrast to the distribution of r-values fornonorthologous human and mouse genes which hada median of approximately 0 (fig 8) The differencebetween the two distributions is highly statisticallysignificant (P 10210) Surprisingly however whenthe resulting r-values were compared with the levels ofsequence divergence to determine whether sequence andregulatory divergence were correlated no significantcorrelation between the human-mouse expression patterndivergence and the four different measures dN dSd(59 UTR) and d(39 UTR) of human-mouse sequencedivergence was detected (fig 9 and table 1) This findingstands in contrast with the results of the recent analysis ofthe connection between the sequence and expressiondivergence between paralogous human genes where asignificant negative correlation was observed betweenthe correlation coefficients of the expression profiles andsequence divergence (Makova and Li 2003) However anearlier study of yeast duplicate genes found no correlation

phyletic similarity

0008

001

0012

0014

0016

0018

002

-079 -066-027-022 -010 018 030 062 099 100

fract

ion

r gt=

07

FIG 7mdashCoexpressed human genes have similar phyletic patternsThe fractions of coexpressed gene pairs (r 07) are shown for 10ascending bins of phyletic similarity values (see Materials and Methods)

Table 1Evolutionary Rate Versus Gene Expression Comparisons

Comparison Ra Pb nc

dN vs expression breadth 21000 55endash7 3906dS vs expression breadth 20976 21endash5 390659 UTR d vs expression breadth 20539 ns 112139 UTR d vs expression breadth 20964 49endash5 1438

dN vs expression level 21000 55endash7 3906dS vs expression level 20952 11endash4 390659 UTR d vs expression level 20624 ns 112139 UTR d vs expression level 20964 49endash5 1438

dN vs number coexpressed genes 20964 49endash5 2307dS vs number coexpressed genes 20746 17endash2 230759 UTR d vs number coexpressedgenes 20588 ns 1119

39 UTR d vs number coexpressedgenes 20818 58endash3 1436

dN difference vs fraction r 07 20976 21endash5 2659971dS difference vs fraction r 07 20249 ns 265997159 UTR d difference vs fractionr 07 20812 58endash3 625521

39 UTR d difference vs fractionr 07 20867 22endash3 1030330

dN vs expression divergence 0139 ns 551dS vs expression divergence 20079 ns 55159 UTR d vs expression divergence 0164 ns 27339 UTR d vs expression divergence 20552 ns 382

a Spearman rank correlation coefficient for the average expression parameter

value against the 10 bins of substitution rate values (see figures 1 and 3ndash5)b Level of significance (n frac14 10) associated with R values ns frac14 nonsignifi-

cant (ie P 005)c The total number of comparisons that were binned

0

01

02

03

04

05

06

07

08

09

1

-1 -05 0 05 1r

cum

ulat

ive

frequ

ency

FIG 8mdashCumulative distributions of pairwise correlation coefficientvalues (r) The black line is the distribution for the comparisons ofnonorthologous genes within the human and mouse gene sets The grayline is the distribution for orthologous human-mouse gene comparisons

2066 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

between gene sequence and gene expression patterndivergence (Wagner 2000) The apparent disparity be-tween orthologs and paralogs in terms of expressionevolution in mammals seems to suggest unexpecteddifferences in the evolutionary modes and deserves furtherinvestigation

Conclusion

The results described here indicate that mammaliancoexpressed genes form a scale-free network whichprobably evolved by self-organization based on thepreferential attachment principle One possible mechanism

of this evolution is the duplication of genes together withtheir transcriptional control regions as suggested for theyeast coexpression network (Bhan Galas and Dewey2002) An examination of the contribution of paralogousgene pairs to the gene coexpression network analyzed hereindicates that such duplicated pairs (see Materials andMethods) are found approximately two times morefrequently than expected by chance (P 10210)However the fraction of coexpressed gene pairs that arerelated by duplication is small (028) suggesting thatduplication may not be a major force affecting thestructure of the gene coexpression network This findingseems to be consistent with an analysis of yeast protein

Table 2Examples of Two Human Gene Coexpression Network Clusters

Locus Link IDGenBank accessiona Gene Symbolb Annotationc Physiological Function

A Pancreas-Specific Gene Cluster

1208P16233 CLPS colipase preproprotein Pancreatic digestive enzyme1357NP_001859 CPA1 pancreatic carboxypeptidase A1 precursor Pancreatic digestive enzyme1358NP_001860 CPA2 pancreatic carboxypeptidase A2 precursor Pancreatic digestive enzyme2641P01275 GCG glucagon preproprotein Pancreatic islet hormone2813AAB19240 GP2 glycoprotein 2 (zymogen granule

membrane)The major membrane protein in thesecretory granule of theexocrine pancreas

2906AAB49992 GRIN2D N-methyl-D-aspartate receptor subunit2D precursor

Implicated in pancreatic hormonesecretionPMID 7768362

148223 NP_689695 hypothetical protein FLJ36666 Uncharacterized protein25803NP_036523 SPDEF SAM pointed domainndashcontaining ets

transcription factorProstate epithelium-specific transcriptionfactor

84444NP_115871 histone methyltransferase DOT1L Regulator of gene dynamics andchromatin activity

9436NP_004819 NCR2 natural cytotoxicity triggering receptorndash2 Mediator of killer cellsrsquocytotoxicity9610AAB67270 RIN1 ras inhibitor RIN1 A dominant negative inhibitor of the

RAS signaling pathway9753AAH41661 ZNF305 zinc finger proteinndash305 Predicted transcription regulator

A Testis-Specific Gene Cluster

6847Q15431 SYCP1 synaptonemal complex proteinndash1 Major component of the transversefilaments of synaptonemalcomplexes during meiotic prophase

10388 Q9BX26 SYCP2 synaptonemal complex proteinndash2 Major component of the axiallateralelements of synaptonemal complexesduring meiotic prophase

27285NP_055281 TEKT2 tektin 2 Major sperm microtubule protein8852NP_647450 AKAP4 A-kinase anchor proteinndash4 isoform 2 Major sperm sheath protein676 AAB87862 BRDT testis-specific bromodomain protein Testis-specific chromatin-associated protein

involved in chromatin remodeling11077 O75031 HSF2BP heat shock transcription factorndash2 binding

proteinImplicated in modulating HSF2activation in testis

7180P16562 CRISP2 testis-specific proteinndash1 Probable sperm-coating glycoprotein11116NP_919410 FGFR1OP FGFR1 oncogene partner isoform b Implicated in the proliferation and

differentiation of the erythroid lineage4291P58340 MLF1 myeloid leukemia factorndash1 Uncharacterized protein implicated in

cell proliferation5261AAH02541 PHKG2 phosphorylase kinase gamma 2 (testis) Testisliver-specific regulator of

glycogen metabolism54344 O94777 DPM3 dolichyl-phosphate mannosyltransferase

polypeptidendash3 isoform 2Regulator of dolichyl-phosphate mannosebiosynthesis Might be important forsperm maturation PMID 6231179

7288O00295 TULP2 tubby-like proteinndash2 Retina-specific and testis-specificheterotrimeric G-proteinndashresponsiveintracellular signaling factor

51460NP_057413 SFMBT1 Scm-like with four mbt domainsndash1 Polycomb group transcription regulator114049 NP_684281 WBSCR22 Williams Beuren syndrome chromosome

region 22 proteintRNA 5-cytosine methylase

a Locus Link identification numbers and GenBank accessions from NCBIrsquos LocusLink database httpwwwncbinlmnihgovLocusLinkindexhtmlb Official gene symbol from the Human Genome Organization (HUGO) httpwwwgeneuclacuknomenclaturec Official HUGO gene name

Evolution and the Human Gene Coexpression Network 2067

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

interaction networks which showed that the networkstructure is not predominantly shaped by duplication(Wagner 2003)

Our results show that overall levels of expressionand more specific topological properties of the gene co-expression network are clearly related to the rates ofsequence evolution highly connected network hubs tendto evolve slowly and genes that are coexpressed showa strong tendency to evolve at similar rates Theserelationships most likely reflect purifying selection whichappears to act primarily at the level of the proteinsequence Generally these results are compatible withthe notion of greater biological importance of highlyconnected nodes of biological networks (Jeong et al2001) However two of the findings reported here appearto be distinctly surprising Firstly we found that theconnection between gene coexpression network topologyand sequence evolution held for both nonsynonymous andsynonymous sites in the coding sequence and 39 UTR butnot for the 59 UTR Thus the constraints on 59 UTRevolution appear to be unrelated to expression regulationas analyzed here Secondly we found that although theexpression profiles of orthologs tend to be stronglycorrelated they diverged at roughly the same rate acrossa wide range of sequence evolution rates Thus unlike theproperties of the gene coexpression network evolution ofan individual genersquos expression regulation after speciationseems to be uncoupled from the functional constraints onthe genersquos sequence This is generally compatible withthe pregenomic notion that regulatory evolution could bethe decisive factor of biological diversification betweenspecies (Britten and Davidson 1969 King and Wilson1975) and can be taken to suggest that sequence diver-gence and expression pattern divergence are governed bydistinct forces

Recent studies have reported the rapid diversificationof gene expression patterns and suggested that this mightreflect neutral evolution of transcriptional regulation(Khaitovich et al 2004 Yanai Graur and Ophir 2004)The lack of correlation between sequence divergence andexpression profile divergence demonstrated here could bedeemed as compatible with the neutral model wherebytranscription profiles of orthologs diverge rapidly uponspeciation until they reach a basal level of similarity that isthen maintained by purifying selection However it istempting to speculate as to a distinct role for naturalselection in driving expression pattern divergence Perhapswhereas sequence divergence levels are determined largelyby purifying selection the effects of adaptive (diversifying)selection are more prevalent at the level of gene expressionUnder this hypothetical scenario although the basal level ofcorrelation between orthologs tends to persist probablyreflecting the conservation of the general function theaspects of the gene expression patterns that reflect species-specific changes governed in part by short degeneratetranscription factorndashbinding sites would be expected todiverge rapidly Indeed our analysis of the relationshipbetween gene coexpression in this work and geneduplication as well as previous results (Gu et al 2002)suggest rapid divergence of expression patterns afterduplication Once the expression patterns of homologous

correlation 01

02

03

04

05

06

001

002

003

004

005

007

009

012

019

052

01

02

03

04

05

06

040

046

050

055

058

062

065

071

079

127

01

02

03

04

05

06

017

024

032

037

045

053

063

072

088

137

01

02

03

04

05

06

017

023

029

034

040

044

050

059

073

120

3rsquo U

TR d

5rsquo

UTR

d

dNdS

FIG9mdash

Thelack

ofdependence

betweenhuman-m

ouse

generegulatory

divergence

andsequence

divergenceAveragecorrelationcoefficientvalues

(r)betweenexpressionprofilesofhuman-m

ouse

orthologousgeneareshownfor10ascendingbinsofhuman-m

ouse

orthologousgenesubstitutionrates

2068 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

genes (paralogs or orthologs) diverge convergent evolu-tion which is a hallmark of adaptation and is widely evidentfor phenotypic characters could cause unrelated genes toachieve similar expression patterns This convergence couldbe related to the rapid de novo generation of transcriptionfactorndashbinding sites in promoters that lead to slight changesin expression patterns If the functionally relevant aspectsof the ancestral expression pattern are maintained duringthis process subtle expression pattern changes would besimultaneously invisible to purifying selection and serve asthe raw material for repeated trials of adaptive selectionThis hypothesis yields a specific prediction with regard tothe evolution of promoter regions that is currently beinginvestigated Expression profiles are predicted to be morecorrelated with the distributions of specific transcriptionfactorndashbinding sites along promoters than with the overallsequence divergence (ie relatedness) between promoters

Acknowledgments

We thank Igor Rogozin David Landsman FyodorKondrashov Itai Yanai and Katerina Makova for helpfuldiscussions

Supplementary Material

Supplementary Figure 1mdashThe dependence betweenexpression breadth level and substitution rates of mousegenes (andashd) Average (6 standard error) expressionbreadth values for 10 ascending bins of human-mouseorthologous gene substitution rates (endashh) Average (6standard error) expression level values for 10 ascendingbins of human-mouse orthologous gene substitution rates

Supplementary Figure 2mdashFrequency distributions ofhuman-mouse substitution rate ratios Ratios shown inlog scale (a) d(59 UTR)d(39 UTR) d(59 UTR)d(39 UTR)is shown in log scale In 43 of genes d(59 UTR) d(39UTR) whereas in 57 d(39 UTR) d(59 UTR) (b)d(59 UTR)dS (c) d(39 UTRdS)

Supplementary Figure 3mdashFrequency distributions ofthe number of links per node in human gene coexpressionnetworks For each network the correlation coefficientvalue (r) threshold used to determine whether any twogenes (nodes) are considered to be coexpressed (linked)are shown above the plot The slope of the linear trend linethat fits the data and the r2 value indicating the goodness offit to the power law are shown for each network plot

Supplementary Figure 4mdashTissue-specific expressionpatterns for human gene coexpression network (r 09)clusters Cluster expression patterns 1 adult and fetal liver2 fetal liver 3 testis 4 pancreas 5 testis and umbilicalvein 6 various brain tissues 7 salivary gland 8 adultliver 9 thymus 10 spleen 11 lung 12 cerebellum

Supplementary Figure 5mdashRelationship between themouse coexpression network topology and substitutionrates (a) The dependence between the node degree andsubstitution rate Average (6 standard error) numbers ofcoexpressed genes (node degree for r 07) per gene for10 ascending bins of human-mouse orthologous gene sub-stitution rates are shown (b) Coexpressedmouse genes havesimilar substitution rates Fractions of pairwise correlation

coefficient values where r 07 between mouse genes areshown for 10 ascending bins of pairwise human-mouseorthologous gene substitution rate differences

Literature Cited

Agrawal H 2002 Extreme self-organization in networksconstructed from gene expression data Phys Rev Lett89268702

Ashburner M C A Ball J A Blake et al 2000 Geneontology tool for the unification of biology The GeneOntology Consortium Nat Genet 2525ndash29

Barabasi A L and R Albert 1999 Emergence of scaling inrandom networks Science 286509ndash512

Barabasi A L and Z N Oltvai 2004 Network biologyunderstanding the cellrsquos functional organization Nat RevGenet 5101ndash113

Bergmann S J Ihmels and N Barkai 2004 Similarities anddifferences in genome-wide expression data of six organismsPLoS Biol 20085ndash0093

Bhan A D J Galas and T G Dewey 2002 A duplicationgrowth model of gene expression networks Bioinformatics181486ndash1493

Bloom J D and C Adami 2003 Apparent dependence ofprotein evolutionary rate on number of interactions is linkedto biases in protein-protein interactions data sets BMC EvolBiol 321

Britten R J and E H Davidson 1969 Gene regulation forhigher cells a theory Science 165349ndash357

Carbone A A Zinovyev and F Kepes 2003 Codon adaptationindex as a measure of dominating codon bias Bioinformatics192005ndash2015

Darwin C 1859 On the origin of species John Murray LondonDuret L and D Mouchiroud 2000 Determinants of sub-

stitution rates in mammalian genes expression pattern affectsselection intensity but not mutation rate Mol Biol Evol1768ndash74

Eisen M B P T Spellman P O Brown and D Botstein1998 Cluster analysis and display of genome-wide expressionpatterns Proc Natl Acad Sci USA 9514863ndash14868

Farris J S 1977 Phylogenetic analysis under Dollorsquos law SystZool 2677ndash88

Fraser H B A E Hirsh L M Steinmetz C Scharfe andM W Feldman 2002 Evolutionary rate in the protein inter-action network Science 296750ndash752

Fraser H B D P Wall and A E Hirsh 2003 A simpledependence between protein evolution rate and the number ofprotein-protein interactions BMC Evol Biol 311

Giot L J S Bader C Brouwer et al 2003 A protein inter-action map of Drosophila melanogaster Science 30217271736

Gu Z D Nicolae H H Lu and W H Li 2002 Rapiddivergence in expression between duplicate genes inferredfrom microarray data Trends Genet 18609ndash613

Hedges S B H Chen S Kumar D Y Wang A S Thompsonand H Watanabe 2001 A genomic timescale for the origin ofeukaryotes BMC Evol Biol 14

Higgins D G J D Thompson and T J Gibson 1996 UsingCLUSTAL for multiple sequence alignments MethodsEnzymol 266383ndash402

Hirsh A E and H B Fraser 2001 Protein dispensability andrate of evolution Nature 4111046ndash1049

Ho Y A Gruhler A Heilbut et al 2002 Systematic iden-tification of protein complexes in Saccharomyces cerevisiaeby mass spectrometry Nature 415180ndash183

Evolution and the Human Gene Coexpression Network 2069

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Hoyle DCMRattray R Jupp andABrass 2002Making senseof microarray data distributions Bioinformatics 18576ndash584

Jeong H S P Mason A L Barabasi and Z N Oltvai 2001Lethality and centrality in protein networks Nature 41141ndash42

Jeong H B Tombor R Albert Z N Oltvai and A LBarabasi 2000 The large-scale organization of metabolicnetworks Nature 407651ndash654

Jordan I K I B Rogozin Y I Wolf and E V Koonin 2002Essential genes are more evolutionarily conserved than arenonessential genes in bacteria Genome Res 12962ndash968

Jordan I K Y I Wolf and E V Koonin 2003 No simpledependence between protein evolution rate and the number ofprotein-protein interactions only the most prolific interactorstend to evolve slowly BMC Evol Biol 31

Jukes T H and C R Cantor 1969 Evolution of proteinmolecules Pp 21ndash132 in H N Munro ed Mammalianprotein metabolism Academic New York

Kamath R S A G Fraser Y Dong et al 2003 Systematicfunctional analysis of the Caenorhabditis elegans genomeusing RNAi Nature 421231ndash237

Karev G P Y I Wolf A Y Rzhetsky F S Berezovskayaand E V Koonin 2002 Birth and death of protein domainsa simple model of evolution explains power law behaviorBMC Evol Biol 218

Khaitovich P G Weiss M Lachmann I Hellman W EnardB Muetzel U Wiekner W Ansorge and S Paabo 2004 Aneutral model of transcriptome evolution PLOS Biol20682ndash0689

King M C and A C Wilson 1975 Evolution at two levels inhumans and chimpanzees Science 188107ndash116

Koonin E V N D Fedorova J D Jackson et al 2004 Acomprehensive evolutionary classification of proteins encodedin complete eukaryotic genomes Genome Biol 5R7

Koonin E V Y I Wolf and G P Karev 2002 The structureof the protein universe and genome evolution Nature420218ndash223

Krylov D M Y I Wolf I B Rogozin and E V Koonin2003 Gene loss protein sequence divergence gene dispens-ability expression level and interactivity are correlated ineukaryotic evolution Genome Res 132229ndash2235

Kuznetsov V A G D Knott and R F Bonner 2002 Generalstatistics of stochastic process of gene expression ineukaryotic cells Genetics 1611321ndash1332

Lander E S L M Linton B Birren et al 2001 Initialsequencing and analysis of the human genome Nature 409860ndash921

Li S C M Armstrong N Bertin et al 2004 A map of theinteractome network of the metazoan C elegans Science 303540ndash543

Li W H 1997 Molecular evolution Sinauer AssociatesSunderland Mass

Luscombe N M J Qian Z Zhang T Johnson and MGerstein 2002 The dominance of the population by a selected

few power-law behaviour applies to a wide variety ofgenomic properties Genome Biol3RESEARCH0040

Makova K D and W H Li 2003 Divergence in the spatialpattern of gene expression between human duplicate genesGenome Res 131638ndash1645

Nei M and T Gojobori 1986 Simple methods for estimatingthe numbers of synonymous and nonsynonymous nucleotidesubstitutions Mol Biol Evol 3418ndash426

Pal C B Papp and L D Hurst 2001 Highly expressed genesin yeast evolve slowly Genetics 158927ndash931

mdashmdashmdash 2003 Genomic function rate of evolution and genedispensability Nature 421496ndash497

Pennisi E 2003 Systems biology tracing lifersquos circuitryScience 3021646ndash1649

Ravasz E A L Somera D A Mongru Z N Oltvai andA L Barabasi 2002 Hierarchical organization of modularityin metabolic networks Science 2971551ndash1555

Rzhetsky A and S M Gomez 2001 Birth of scale-freemolecular networks and the number of distinct DNA andprotein domains per genome Bioinformatics 17988ndash996

Su A I M P Cooke K A Ching et al 2002 Large-scaleanalysis of the human and mouse transcriptomes Proc NatlAcad Sci USA 994465ndash4470

Tatusov R L N D Fedorova J D Jackson et al 2003 TheCOG database an updated version includes eukaryotes BMCBioinformatics 441

Ueda H R S Hayshi S Matsuyama T Yomo S HashimotoS A Kay J B Hogenesch and M Iino 2004 Universalityand flexibility in gene expression from bacteria to humanProc Natl Acad Sci USA 1013765ndash3769

Wagner A 2000 Decoupled evolution of coding region andmRNA expression patterns after gene duplication implica-tions for the neutralist-selectionist debate Proc Natl AcadSci USA 976579ndash6584

mdashmdashmdash 2003 How the global structure of protein interaction net-works evolves Proc R Soc Lond B Biol Sci 270457ndash466

Waterston R H K Lindblad-Toh E Birney et al 2002 Initialsequencing and comparative analysis of the mouse genomeNature 420520ndash562

Yanai I D Graur and R Ophir 2004 Incongruent expressionprofiles between human and mouse orthologous genes suggestwidespread neutral evolution of transcription control OMICS815ndash24

Yang J ZGu andWHLi 2003Rate of protein evolution versusfitness effect of gene deletion Mol Biol Evol 20772ndash774

Yang Z 1997 PAML a programpackage for phylogenetic analysisby maximum likelihood Comput Appl Biosci 13555ndash556

Zhang L and W H Li 2004 Mammalian housekeeping genesevolve more slowly than tissue-specific genes Mol BiolEvol 21236ndash239

Peer Bork Associate Editor

Accepted July 14 2004

2070 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Page 9: Conservation and Coevolution in the Scale-Free Human Gene

a phyletic pattern that is the pattern of presenceabsenceof orthologous genes from the given KOG among seveneukaryotic species The phyletic patterns of KOGs werecompared to determine their similarity in terms ofpropensities for gene gain and loss over the approximately18 billion years of eukaryotic crown group evolution(Krylov et al 2003) A statistically significant positivecorrelation was detected between the similarity of evolu-tionary patterns of gain-loss and the level of coexpressionamong pairs of human genes (fig 7 and table 1) Thuscoexpressed genes tend to evolve under similar long-termevolutionary constraints

Human-Mouse Sequence Divergence Versus ExpressionProfile Divergence

Expression data collected for human and mouse werefurther compared to assess levels of regulatory divergence

(divergence of expression profiles) between orthologousgenes in the two species From the original series ofmicroarray experiments (Su et al 2002) 19 tissues thatwere studied in both human and mouse were identifiedRelative levels of gene expression across these tissues inboth species were taken as vectors and pairs of vectorsfor human-mouse orthologous genes were compared tomeasure the extent of regulatory divergence betweenspecies In a general agreement with the original report (Suet al 2002) the r-values for orthologs displayed a narrowdistribution with the median at approximately 04 Thiswas in sharp contrast to the distribution of r-values fornonorthologous human and mouse genes which hada median of approximately 0 (fig 8) The differencebetween the two distributions is highly statisticallysignificant (P 10210) Surprisingly however whenthe resulting r-values were compared with the levels ofsequence divergence to determine whether sequence andregulatory divergence were correlated no significantcorrelation between the human-mouse expression patterndivergence and the four different measures dN dSd(59 UTR) and d(39 UTR) of human-mouse sequencedivergence was detected (fig 9 and table 1) This findingstands in contrast with the results of the recent analysis ofthe connection between the sequence and expressiondivergence between paralogous human genes where asignificant negative correlation was observed betweenthe correlation coefficients of the expression profiles andsequence divergence (Makova and Li 2003) However anearlier study of yeast duplicate genes found no correlation

phyletic similarity

0008

001

0012

0014

0016

0018

002

-079 -066-027-022 -010 018 030 062 099 100

fract

ion

r gt=

07

FIG 7mdashCoexpressed human genes have similar phyletic patternsThe fractions of coexpressed gene pairs (r 07) are shown for 10ascending bins of phyletic similarity values (see Materials and Methods)

Table 1Evolutionary Rate Versus Gene Expression Comparisons

Comparison Ra Pb nc

dN vs expression breadth 21000 55endash7 3906dS vs expression breadth 20976 21endash5 390659 UTR d vs expression breadth 20539 ns 112139 UTR d vs expression breadth 20964 49endash5 1438

dN vs expression level 21000 55endash7 3906dS vs expression level 20952 11endash4 390659 UTR d vs expression level 20624 ns 112139 UTR d vs expression level 20964 49endash5 1438

dN vs number coexpressed genes 20964 49endash5 2307dS vs number coexpressed genes 20746 17endash2 230759 UTR d vs number coexpressedgenes 20588 ns 1119

39 UTR d vs number coexpressedgenes 20818 58endash3 1436

dN difference vs fraction r 07 20976 21endash5 2659971dS difference vs fraction r 07 20249 ns 265997159 UTR d difference vs fractionr 07 20812 58endash3 625521

39 UTR d difference vs fractionr 07 20867 22endash3 1030330

dN vs expression divergence 0139 ns 551dS vs expression divergence 20079 ns 55159 UTR d vs expression divergence 0164 ns 27339 UTR d vs expression divergence 20552 ns 382

a Spearman rank correlation coefficient for the average expression parameter

value against the 10 bins of substitution rate values (see figures 1 and 3ndash5)b Level of significance (n frac14 10) associated with R values ns frac14 nonsignifi-

cant (ie P 005)c The total number of comparisons that were binned

0

01

02

03

04

05

06

07

08

09

1

-1 -05 0 05 1r

cum

ulat

ive

frequ

ency

FIG 8mdashCumulative distributions of pairwise correlation coefficientvalues (r) The black line is the distribution for the comparisons ofnonorthologous genes within the human and mouse gene sets The grayline is the distribution for orthologous human-mouse gene comparisons

2066 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

between gene sequence and gene expression patterndivergence (Wagner 2000) The apparent disparity be-tween orthologs and paralogs in terms of expressionevolution in mammals seems to suggest unexpecteddifferences in the evolutionary modes and deserves furtherinvestigation

Conclusion

The results described here indicate that mammaliancoexpressed genes form a scale-free network whichprobably evolved by self-organization based on thepreferential attachment principle One possible mechanism

of this evolution is the duplication of genes together withtheir transcriptional control regions as suggested for theyeast coexpression network (Bhan Galas and Dewey2002) An examination of the contribution of paralogousgene pairs to the gene coexpression network analyzed hereindicates that such duplicated pairs (see Materials andMethods) are found approximately two times morefrequently than expected by chance (P 10210)However the fraction of coexpressed gene pairs that arerelated by duplication is small (028) suggesting thatduplication may not be a major force affecting thestructure of the gene coexpression network This findingseems to be consistent with an analysis of yeast protein

Table 2Examples of Two Human Gene Coexpression Network Clusters

Locus Link IDGenBank accessiona Gene Symbolb Annotationc Physiological Function

A Pancreas-Specific Gene Cluster

1208P16233 CLPS colipase preproprotein Pancreatic digestive enzyme1357NP_001859 CPA1 pancreatic carboxypeptidase A1 precursor Pancreatic digestive enzyme1358NP_001860 CPA2 pancreatic carboxypeptidase A2 precursor Pancreatic digestive enzyme2641P01275 GCG glucagon preproprotein Pancreatic islet hormone2813AAB19240 GP2 glycoprotein 2 (zymogen granule

membrane)The major membrane protein in thesecretory granule of theexocrine pancreas

2906AAB49992 GRIN2D N-methyl-D-aspartate receptor subunit2D precursor

Implicated in pancreatic hormonesecretionPMID 7768362

148223 NP_689695 hypothetical protein FLJ36666 Uncharacterized protein25803NP_036523 SPDEF SAM pointed domainndashcontaining ets

transcription factorProstate epithelium-specific transcriptionfactor

84444NP_115871 histone methyltransferase DOT1L Regulator of gene dynamics andchromatin activity

9436NP_004819 NCR2 natural cytotoxicity triggering receptorndash2 Mediator of killer cellsrsquocytotoxicity9610AAB67270 RIN1 ras inhibitor RIN1 A dominant negative inhibitor of the

RAS signaling pathway9753AAH41661 ZNF305 zinc finger proteinndash305 Predicted transcription regulator

A Testis-Specific Gene Cluster

6847Q15431 SYCP1 synaptonemal complex proteinndash1 Major component of the transversefilaments of synaptonemalcomplexes during meiotic prophase

10388 Q9BX26 SYCP2 synaptonemal complex proteinndash2 Major component of the axiallateralelements of synaptonemal complexesduring meiotic prophase

27285NP_055281 TEKT2 tektin 2 Major sperm microtubule protein8852NP_647450 AKAP4 A-kinase anchor proteinndash4 isoform 2 Major sperm sheath protein676 AAB87862 BRDT testis-specific bromodomain protein Testis-specific chromatin-associated protein

involved in chromatin remodeling11077 O75031 HSF2BP heat shock transcription factorndash2 binding

proteinImplicated in modulating HSF2activation in testis

7180P16562 CRISP2 testis-specific proteinndash1 Probable sperm-coating glycoprotein11116NP_919410 FGFR1OP FGFR1 oncogene partner isoform b Implicated in the proliferation and

differentiation of the erythroid lineage4291P58340 MLF1 myeloid leukemia factorndash1 Uncharacterized protein implicated in

cell proliferation5261AAH02541 PHKG2 phosphorylase kinase gamma 2 (testis) Testisliver-specific regulator of

glycogen metabolism54344 O94777 DPM3 dolichyl-phosphate mannosyltransferase

polypeptidendash3 isoform 2Regulator of dolichyl-phosphate mannosebiosynthesis Might be important forsperm maturation PMID 6231179

7288O00295 TULP2 tubby-like proteinndash2 Retina-specific and testis-specificheterotrimeric G-proteinndashresponsiveintracellular signaling factor

51460NP_057413 SFMBT1 Scm-like with four mbt domainsndash1 Polycomb group transcription regulator114049 NP_684281 WBSCR22 Williams Beuren syndrome chromosome

region 22 proteintRNA 5-cytosine methylase

a Locus Link identification numbers and GenBank accessions from NCBIrsquos LocusLink database httpwwwncbinlmnihgovLocusLinkindexhtmlb Official gene symbol from the Human Genome Organization (HUGO) httpwwwgeneuclacuknomenclaturec Official HUGO gene name

Evolution and the Human Gene Coexpression Network 2067

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

interaction networks which showed that the networkstructure is not predominantly shaped by duplication(Wagner 2003)

Our results show that overall levels of expressionand more specific topological properties of the gene co-expression network are clearly related to the rates ofsequence evolution highly connected network hubs tendto evolve slowly and genes that are coexpressed showa strong tendency to evolve at similar rates Theserelationships most likely reflect purifying selection whichappears to act primarily at the level of the proteinsequence Generally these results are compatible withthe notion of greater biological importance of highlyconnected nodes of biological networks (Jeong et al2001) However two of the findings reported here appearto be distinctly surprising Firstly we found that theconnection between gene coexpression network topologyand sequence evolution held for both nonsynonymous andsynonymous sites in the coding sequence and 39 UTR butnot for the 59 UTR Thus the constraints on 59 UTRevolution appear to be unrelated to expression regulationas analyzed here Secondly we found that although theexpression profiles of orthologs tend to be stronglycorrelated they diverged at roughly the same rate acrossa wide range of sequence evolution rates Thus unlike theproperties of the gene coexpression network evolution ofan individual genersquos expression regulation after speciationseems to be uncoupled from the functional constraints onthe genersquos sequence This is generally compatible withthe pregenomic notion that regulatory evolution could bethe decisive factor of biological diversification betweenspecies (Britten and Davidson 1969 King and Wilson1975) and can be taken to suggest that sequence diver-gence and expression pattern divergence are governed bydistinct forces

Recent studies have reported the rapid diversificationof gene expression patterns and suggested that this mightreflect neutral evolution of transcriptional regulation(Khaitovich et al 2004 Yanai Graur and Ophir 2004)The lack of correlation between sequence divergence andexpression profile divergence demonstrated here could bedeemed as compatible with the neutral model wherebytranscription profiles of orthologs diverge rapidly uponspeciation until they reach a basal level of similarity that isthen maintained by purifying selection However it istempting to speculate as to a distinct role for naturalselection in driving expression pattern divergence Perhapswhereas sequence divergence levels are determined largelyby purifying selection the effects of adaptive (diversifying)selection are more prevalent at the level of gene expressionUnder this hypothetical scenario although the basal level ofcorrelation between orthologs tends to persist probablyreflecting the conservation of the general function theaspects of the gene expression patterns that reflect species-specific changes governed in part by short degeneratetranscription factorndashbinding sites would be expected todiverge rapidly Indeed our analysis of the relationshipbetween gene coexpression in this work and geneduplication as well as previous results (Gu et al 2002)suggest rapid divergence of expression patterns afterduplication Once the expression patterns of homologous

correlation 01

02

03

04

05

06

001

002

003

004

005

007

009

012

019

052

01

02

03

04

05

06

040

046

050

055

058

062

065

071

079

127

01

02

03

04

05

06

017

024

032

037

045

053

063

072

088

137

01

02

03

04

05

06

017

023

029

034

040

044

050

059

073

120

3rsquo U

TR d

5rsquo

UTR

d

dNdS

FIG9mdash

Thelack

ofdependence

betweenhuman-m

ouse

generegulatory

divergence

andsequence

divergenceAveragecorrelationcoefficientvalues

(r)betweenexpressionprofilesofhuman-m

ouse

orthologousgeneareshownfor10ascendingbinsofhuman-m

ouse

orthologousgenesubstitutionrates

2068 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

genes (paralogs or orthologs) diverge convergent evolu-tion which is a hallmark of adaptation and is widely evidentfor phenotypic characters could cause unrelated genes toachieve similar expression patterns This convergence couldbe related to the rapid de novo generation of transcriptionfactorndashbinding sites in promoters that lead to slight changesin expression patterns If the functionally relevant aspectsof the ancestral expression pattern are maintained duringthis process subtle expression pattern changes would besimultaneously invisible to purifying selection and serve asthe raw material for repeated trials of adaptive selectionThis hypothesis yields a specific prediction with regard tothe evolution of promoter regions that is currently beinginvestigated Expression profiles are predicted to be morecorrelated with the distributions of specific transcriptionfactorndashbinding sites along promoters than with the overallsequence divergence (ie relatedness) between promoters

Acknowledgments

We thank Igor Rogozin David Landsman FyodorKondrashov Itai Yanai and Katerina Makova for helpfuldiscussions

Supplementary Material

Supplementary Figure 1mdashThe dependence betweenexpression breadth level and substitution rates of mousegenes (andashd) Average (6 standard error) expressionbreadth values for 10 ascending bins of human-mouseorthologous gene substitution rates (endashh) Average (6standard error) expression level values for 10 ascendingbins of human-mouse orthologous gene substitution rates

Supplementary Figure 2mdashFrequency distributions ofhuman-mouse substitution rate ratios Ratios shown inlog scale (a) d(59 UTR)d(39 UTR) d(59 UTR)d(39 UTR)is shown in log scale In 43 of genes d(59 UTR) d(39UTR) whereas in 57 d(39 UTR) d(59 UTR) (b)d(59 UTR)dS (c) d(39 UTRdS)

Supplementary Figure 3mdashFrequency distributions ofthe number of links per node in human gene coexpressionnetworks For each network the correlation coefficientvalue (r) threshold used to determine whether any twogenes (nodes) are considered to be coexpressed (linked)are shown above the plot The slope of the linear trend linethat fits the data and the r2 value indicating the goodness offit to the power law are shown for each network plot

Supplementary Figure 4mdashTissue-specific expressionpatterns for human gene coexpression network (r 09)clusters Cluster expression patterns 1 adult and fetal liver2 fetal liver 3 testis 4 pancreas 5 testis and umbilicalvein 6 various brain tissues 7 salivary gland 8 adultliver 9 thymus 10 spleen 11 lung 12 cerebellum

Supplementary Figure 5mdashRelationship between themouse coexpression network topology and substitutionrates (a) The dependence between the node degree andsubstitution rate Average (6 standard error) numbers ofcoexpressed genes (node degree for r 07) per gene for10 ascending bins of human-mouse orthologous gene sub-stitution rates are shown (b) Coexpressedmouse genes havesimilar substitution rates Fractions of pairwise correlation

coefficient values where r 07 between mouse genes areshown for 10 ascending bins of pairwise human-mouseorthologous gene substitution rate differences

Literature Cited

Agrawal H 2002 Extreme self-organization in networksconstructed from gene expression data Phys Rev Lett89268702

Ashburner M C A Ball J A Blake et al 2000 Geneontology tool for the unification of biology The GeneOntology Consortium Nat Genet 2525ndash29

Barabasi A L and R Albert 1999 Emergence of scaling inrandom networks Science 286509ndash512

Barabasi A L and Z N Oltvai 2004 Network biologyunderstanding the cellrsquos functional organization Nat RevGenet 5101ndash113

Bergmann S J Ihmels and N Barkai 2004 Similarities anddifferences in genome-wide expression data of six organismsPLoS Biol 20085ndash0093

Bhan A D J Galas and T G Dewey 2002 A duplicationgrowth model of gene expression networks Bioinformatics181486ndash1493

Bloom J D and C Adami 2003 Apparent dependence ofprotein evolutionary rate on number of interactions is linkedto biases in protein-protein interactions data sets BMC EvolBiol 321

Britten R J and E H Davidson 1969 Gene regulation forhigher cells a theory Science 165349ndash357

Carbone A A Zinovyev and F Kepes 2003 Codon adaptationindex as a measure of dominating codon bias Bioinformatics192005ndash2015

Darwin C 1859 On the origin of species John Murray LondonDuret L and D Mouchiroud 2000 Determinants of sub-

stitution rates in mammalian genes expression pattern affectsselection intensity but not mutation rate Mol Biol Evol1768ndash74

Eisen M B P T Spellman P O Brown and D Botstein1998 Cluster analysis and display of genome-wide expressionpatterns Proc Natl Acad Sci USA 9514863ndash14868

Farris J S 1977 Phylogenetic analysis under Dollorsquos law SystZool 2677ndash88

Fraser H B A E Hirsh L M Steinmetz C Scharfe andM W Feldman 2002 Evolutionary rate in the protein inter-action network Science 296750ndash752

Fraser H B D P Wall and A E Hirsh 2003 A simpledependence between protein evolution rate and the number ofprotein-protein interactions BMC Evol Biol 311

Giot L J S Bader C Brouwer et al 2003 A protein inter-action map of Drosophila melanogaster Science 30217271736

Gu Z D Nicolae H H Lu and W H Li 2002 Rapiddivergence in expression between duplicate genes inferredfrom microarray data Trends Genet 18609ndash613

Hedges S B H Chen S Kumar D Y Wang A S Thompsonand H Watanabe 2001 A genomic timescale for the origin ofeukaryotes BMC Evol Biol 14

Higgins D G J D Thompson and T J Gibson 1996 UsingCLUSTAL for multiple sequence alignments MethodsEnzymol 266383ndash402

Hirsh A E and H B Fraser 2001 Protein dispensability andrate of evolution Nature 4111046ndash1049

Ho Y A Gruhler A Heilbut et al 2002 Systematic iden-tification of protein complexes in Saccharomyces cerevisiaeby mass spectrometry Nature 415180ndash183

Evolution and the Human Gene Coexpression Network 2069

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Hoyle DCMRattray R Jupp andABrass 2002Making senseof microarray data distributions Bioinformatics 18576ndash584

Jeong H S P Mason A L Barabasi and Z N Oltvai 2001Lethality and centrality in protein networks Nature 41141ndash42

Jeong H B Tombor R Albert Z N Oltvai and A LBarabasi 2000 The large-scale organization of metabolicnetworks Nature 407651ndash654

Jordan I K I B Rogozin Y I Wolf and E V Koonin 2002Essential genes are more evolutionarily conserved than arenonessential genes in bacteria Genome Res 12962ndash968

Jordan I K Y I Wolf and E V Koonin 2003 No simpledependence between protein evolution rate and the number ofprotein-protein interactions only the most prolific interactorstend to evolve slowly BMC Evol Biol 31

Jukes T H and C R Cantor 1969 Evolution of proteinmolecules Pp 21ndash132 in H N Munro ed Mammalianprotein metabolism Academic New York

Kamath R S A G Fraser Y Dong et al 2003 Systematicfunctional analysis of the Caenorhabditis elegans genomeusing RNAi Nature 421231ndash237

Karev G P Y I Wolf A Y Rzhetsky F S Berezovskayaand E V Koonin 2002 Birth and death of protein domainsa simple model of evolution explains power law behaviorBMC Evol Biol 218

Khaitovich P G Weiss M Lachmann I Hellman W EnardB Muetzel U Wiekner W Ansorge and S Paabo 2004 Aneutral model of transcriptome evolution PLOS Biol20682ndash0689

King M C and A C Wilson 1975 Evolution at two levels inhumans and chimpanzees Science 188107ndash116

Koonin E V N D Fedorova J D Jackson et al 2004 Acomprehensive evolutionary classification of proteins encodedin complete eukaryotic genomes Genome Biol 5R7

Koonin E V Y I Wolf and G P Karev 2002 The structureof the protein universe and genome evolution Nature420218ndash223

Krylov D M Y I Wolf I B Rogozin and E V Koonin2003 Gene loss protein sequence divergence gene dispens-ability expression level and interactivity are correlated ineukaryotic evolution Genome Res 132229ndash2235

Kuznetsov V A G D Knott and R F Bonner 2002 Generalstatistics of stochastic process of gene expression ineukaryotic cells Genetics 1611321ndash1332

Lander E S L M Linton B Birren et al 2001 Initialsequencing and analysis of the human genome Nature 409860ndash921

Li S C M Armstrong N Bertin et al 2004 A map of theinteractome network of the metazoan C elegans Science 303540ndash543

Li W H 1997 Molecular evolution Sinauer AssociatesSunderland Mass

Luscombe N M J Qian Z Zhang T Johnson and MGerstein 2002 The dominance of the population by a selected

few power-law behaviour applies to a wide variety ofgenomic properties Genome Biol3RESEARCH0040

Makova K D and W H Li 2003 Divergence in the spatialpattern of gene expression between human duplicate genesGenome Res 131638ndash1645

Nei M and T Gojobori 1986 Simple methods for estimatingthe numbers of synonymous and nonsynonymous nucleotidesubstitutions Mol Biol Evol 3418ndash426

Pal C B Papp and L D Hurst 2001 Highly expressed genesin yeast evolve slowly Genetics 158927ndash931

mdashmdashmdash 2003 Genomic function rate of evolution and genedispensability Nature 421496ndash497

Pennisi E 2003 Systems biology tracing lifersquos circuitryScience 3021646ndash1649

Ravasz E A L Somera D A Mongru Z N Oltvai andA L Barabasi 2002 Hierarchical organization of modularityin metabolic networks Science 2971551ndash1555

Rzhetsky A and S M Gomez 2001 Birth of scale-freemolecular networks and the number of distinct DNA andprotein domains per genome Bioinformatics 17988ndash996

Su A I M P Cooke K A Ching et al 2002 Large-scaleanalysis of the human and mouse transcriptomes Proc NatlAcad Sci USA 994465ndash4470

Tatusov R L N D Fedorova J D Jackson et al 2003 TheCOG database an updated version includes eukaryotes BMCBioinformatics 441

Ueda H R S Hayshi S Matsuyama T Yomo S HashimotoS A Kay J B Hogenesch and M Iino 2004 Universalityand flexibility in gene expression from bacteria to humanProc Natl Acad Sci USA 1013765ndash3769

Wagner A 2000 Decoupled evolution of coding region andmRNA expression patterns after gene duplication implica-tions for the neutralist-selectionist debate Proc Natl AcadSci USA 976579ndash6584

mdashmdashmdash 2003 How the global structure of protein interaction net-works evolves Proc R Soc Lond B Biol Sci 270457ndash466

Waterston R H K Lindblad-Toh E Birney et al 2002 Initialsequencing and comparative analysis of the mouse genomeNature 420520ndash562

Yanai I D Graur and R Ophir 2004 Incongruent expressionprofiles between human and mouse orthologous genes suggestwidespread neutral evolution of transcription control OMICS815ndash24

Yang J ZGu andWHLi 2003Rate of protein evolution versusfitness effect of gene deletion Mol Biol Evol 20772ndash774

Yang Z 1997 PAML a programpackage for phylogenetic analysisby maximum likelihood Comput Appl Biosci 13555ndash556

Zhang L and W H Li 2004 Mammalian housekeeping genesevolve more slowly than tissue-specific genes Mol BiolEvol 21236ndash239

Peer Bork Associate Editor

Accepted July 14 2004

2070 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Page 10: Conservation and Coevolution in the Scale-Free Human Gene

between gene sequence and gene expression patterndivergence (Wagner 2000) The apparent disparity be-tween orthologs and paralogs in terms of expressionevolution in mammals seems to suggest unexpecteddifferences in the evolutionary modes and deserves furtherinvestigation

Conclusion

The results described here indicate that mammaliancoexpressed genes form a scale-free network whichprobably evolved by self-organization based on thepreferential attachment principle One possible mechanism

of this evolution is the duplication of genes together withtheir transcriptional control regions as suggested for theyeast coexpression network (Bhan Galas and Dewey2002) An examination of the contribution of paralogousgene pairs to the gene coexpression network analyzed hereindicates that such duplicated pairs (see Materials andMethods) are found approximately two times morefrequently than expected by chance (P 10210)However the fraction of coexpressed gene pairs that arerelated by duplication is small (028) suggesting thatduplication may not be a major force affecting thestructure of the gene coexpression network This findingseems to be consistent with an analysis of yeast protein

Table 2Examples of Two Human Gene Coexpression Network Clusters

Locus Link IDGenBank accessiona Gene Symbolb Annotationc Physiological Function

A Pancreas-Specific Gene Cluster

1208P16233 CLPS colipase preproprotein Pancreatic digestive enzyme1357NP_001859 CPA1 pancreatic carboxypeptidase A1 precursor Pancreatic digestive enzyme1358NP_001860 CPA2 pancreatic carboxypeptidase A2 precursor Pancreatic digestive enzyme2641P01275 GCG glucagon preproprotein Pancreatic islet hormone2813AAB19240 GP2 glycoprotein 2 (zymogen granule

membrane)The major membrane protein in thesecretory granule of theexocrine pancreas

2906AAB49992 GRIN2D N-methyl-D-aspartate receptor subunit2D precursor

Implicated in pancreatic hormonesecretionPMID 7768362

148223 NP_689695 hypothetical protein FLJ36666 Uncharacterized protein25803NP_036523 SPDEF SAM pointed domainndashcontaining ets

transcription factorProstate epithelium-specific transcriptionfactor

84444NP_115871 histone methyltransferase DOT1L Regulator of gene dynamics andchromatin activity

9436NP_004819 NCR2 natural cytotoxicity triggering receptorndash2 Mediator of killer cellsrsquocytotoxicity9610AAB67270 RIN1 ras inhibitor RIN1 A dominant negative inhibitor of the

RAS signaling pathway9753AAH41661 ZNF305 zinc finger proteinndash305 Predicted transcription regulator

A Testis-Specific Gene Cluster

6847Q15431 SYCP1 synaptonemal complex proteinndash1 Major component of the transversefilaments of synaptonemalcomplexes during meiotic prophase

10388 Q9BX26 SYCP2 synaptonemal complex proteinndash2 Major component of the axiallateralelements of synaptonemal complexesduring meiotic prophase

27285NP_055281 TEKT2 tektin 2 Major sperm microtubule protein8852NP_647450 AKAP4 A-kinase anchor proteinndash4 isoform 2 Major sperm sheath protein676 AAB87862 BRDT testis-specific bromodomain protein Testis-specific chromatin-associated protein

involved in chromatin remodeling11077 O75031 HSF2BP heat shock transcription factorndash2 binding

proteinImplicated in modulating HSF2activation in testis

7180P16562 CRISP2 testis-specific proteinndash1 Probable sperm-coating glycoprotein11116NP_919410 FGFR1OP FGFR1 oncogene partner isoform b Implicated in the proliferation and

differentiation of the erythroid lineage4291P58340 MLF1 myeloid leukemia factorndash1 Uncharacterized protein implicated in

cell proliferation5261AAH02541 PHKG2 phosphorylase kinase gamma 2 (testis) Testisliver-specific regulator of

glycogen metabolism54344 O94777 DPM3 dolichyl-phosphate mannosyltransferase

polypeptidendash3 isoform 2Regulator of dolichyl-phosphate mannosebiosynthesis Might be important forsperm maturation PMID 6231179

7288O00295 TULP2 tubby-like proteinndash2 Retina-specific and testis-specificheterotrimeric G-proteinndashresponsiveintracellular signaling factor

51460NP_057413 SFMBT1 Scm-like with four mbt domainsndash1 Polycomb group transcription regulator114049 NP_684281 WBSCR22 Williams Beuren syndrome chromosome

region 22 proteintRNA 5-cytosine methylase

a Locus Link identification numbers and GenBank accessions from NCBIrsquos LocusLink database httpwwwncbinlmnihgovLocusLinkindexhtmlb Official gene symbol from the Human Genome Organization (HUGO) httpwwwgeneuclacuknomenclaturec Official HUGO gene name

Evolution and the Human Gene Coexpression Network 2067

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

interaction networks which showed that the networkstructure is not predominantly shaped by duplication(Wagner 2003)

Our results show that overall levels of expressionand more specific topological properties of the gene co-expression network are clearly related to the rates ofsequence evolution highly connected network hubs tendto evolve slowly and genes that are coexpressed showa strong tendency to evolve at similar rates Theserelationships most likely reflect purifying selection whichappears to act primarily at the level of the proteinsequence Generally these results are compatible withthe notion of greater biological importance of highlyconnected nodes of biological networks (Jeong et al2001) However two of the findings reported here appearto be distinctly surprising Firstly we found that theconnection between gene coexpression network topologyand sequence evolution held for both nonsynonymous andsynonymous sites in the coding sequence and 39 UTR butnot for the 59 UTR Thus the constraints on 59 UTRevolution appear to be unrelated to expression regulationas analyzed here Secondly we found that although theexpression profiles of orthologs tend to be stronglycorrelated they diverged at roughly the same rate acrossa wide range of sequence evolution rates Thus unlike theproperties of the gene coexpression network evolution ofan individual genersquos expression regulation after speciationseems to be uncoupled from the functional constraints onthe genersquos sequence This is generally compatible withthe pregenomic notion that regulatory evolution could bethe decisive factor of biological diversification betweenspecies (Britten and Davidson 1969 King and Wilson1975) and can be taken to suggest that sequence diver-gence and expression pattern divergence are governed bydistinct forces

Recent studies have reported the rapid diversificationof gene expression patterns and suggested that this mightreflect neutral evolution of transcriptional regulation(Khaitovich et al 2004 Yanai Graur and Ophir 2004)The lack of correlation between sequence divergence andexpression profile divergence demonstrated here could bedeemed as compatible with the neutral model wherebytranscription profiles of orthologs diverge rapidly uponspeciation until they reach a basal level of similarity that isthen maintained by purifying selection However it istempting to speculate as to a distinct role for naturalselection in driving expression pattern divergence Perhapswhereas sequence divergence levels are determined largelyby purifying selection the effects of adaptive (diversifying)selection are more prevalent at the level of gene expressionUnder this hypothetical scenario although the basal level ofcorrelation between orthologs tends to persist probablyreflecting the conservation of the general function theaspects of the gene expression patterns that reflect species-specific changes governed in part by short degeneratetranscription factorndashbinding sites would be expected todiverge rapidly Indeed our analysis of the relationshipbetween gene coexpression in this work and geneduplication as well as previous results (Gu et al 2002)suggest rapid divergence of expression patterns afterduplication Once the expression patterns of homologous

correlation 01

02

03

04

05

06

001

002

003

004

005

007

009

012

019

052

01

02

03

04

05

06

040

046

050

055

058

062

065

071

079

127

01

02

03

04

05

06

017

024

032

037

045

053

063

072

088

137

01

02

03

04

05

06

017

023

029

034

040

044

050

059

073

120

3rsquo U

TR d

5rsquo

UTR

d

dNdS

FIG9mdash

Thelack

ofdependence

betweenhuman-m

ouse

generegulatory

divergence

andsequence

divergenceAveragecorrelationcoefficientvalues

(r)betweenexpressionprofilesofhuman-m

ouse

orthologousgeneareshownfor10ascendingbinsofhuman-m

ouse

orthologousgenesubstitutionrates

2068 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

genes (paralogs or orthologs) diverge convergent evolu-tion which is a hallmark of adaptation and is widely evidentfor phenotypic characters could cause unrelated genes toachieve similar expression patterns This convergence couldbe related to the rapid de novo generation of transcriptionfactorndashbinding sites in promoters that lead to slight changesin expression patterns If the functionally relevant aspectsof the ancestral expression pattern are maintained duringthis process subtle expression pattern changes would besimultaneously invisible to purifying selection and serve asthe raw material for repeated trials of adaptive selectionThis hypothesis yields a specific prediction with regard tothe evolution of promoter regions that is currently beinginvestigated Expression profiles are predicted to be morecorrelated with the distributions of specific transcriptionfactorndashbinding sites along promoters than with the overallsequence divergence (ie relatedness) between promoters

Acknowledgments

We thank Igor Rogozin David Landsman FyodorKondrashov Itai Yanai and Katerina Makova for helpfuldiscussions

Supplementary Material

Supplementary Figure 1mdashThe dependence betweenexpression breadth level and substitution rates of mousegenes (andashd) Average (6 standard error) expressionbreadth values for 10 ascending bins of human-mouseorthologous gene substitution rates (endashh) Average (6standard error) expression level values for 10 ascendingbins of human-mouse orthologous gene substitution rates

Supplementary Figure 2mdashFrequency distributions ofhuman-mouse substitution rate ratios Ratios shown inlog scale (a) d(59 UTR)d(39 UTR) d(59 UTR)d(39 UTR)is shown in log scale In 43 of genes d(59 UTR) d(39UTR) whereas in 57 d(39 UTR) d(59 UTR) (b)d(59 UTR)dS (c) d(39 UTRdS)

Supplementary Figure 3mdashFrequency distributions ofthe number of links per node in human gene coexpressionnetworks For each network the correlation coefficientvalue (r) threshold used to determine whether any twogenes (nodes) are considered to be coexpressed (linked)are shown above the plot The slope of the linear trend linethat fits the data and the r2 value indicating the goodness offit to the power law are shown for each network plot

Supplementary Figure 4mdashTissue-specific expressionpatterns for human gene coexpression network (r 09)clusters Cluster expression patterns 1 adult and fetal liver2 fetal liver 3 testis 4 pancreas 5 testis and umbilicalvein 6 various brain tissues 7 salivary gland 8 adultliver 9 thymus 10 spleen 11 lung 12 cerebellum

Supplementary Figure 5mdashRelationship between themouse coexpression network topology and substitutionrates (a) The dependence between the node degree andsubstitution rate Average (6 standard error) numbers ofcoexpressed genes (node degree for r 07) per gene for10 ascending bins of human-mouse orthologous gene sub-stitution rates are shown (b) Coexpressedmouse genes havesimilar substitution rates Fractions of pairwise correlation

coefficient values where r 07 between mouse genes areshown for 10 ascending bins of pairwise human-mouseorthologous gene substitution rate differences

Literature Cited

Agrawal H 2002 Extreme self-organization in networksconstructed from gene expression data Phys Rev Lett89268702

Ashburner M C A Ball J A Blake et al 2000 Geneontology tool for the unification of biology The GeneOntology Consortium Nat Genet 2525ndash29

Barabasi A L and R Albert 1999 Emergence of scaling inrandom networks Science 286509ndash512

Barabasi A L and Z N Oltvai 2004 Network biologyunderstanding the cellrsquos functional organization Nat RevGenet 5101ndash113

Bergmann S J Ihmels and N Barkai 2004 Similarities anddifferences in genome-wide expression data of six organismsPLoS Biol 20085ndash0093

Bhan A D J Galas and T G Dewey 2002 A duplicationgrowth model of gene expression networks Bioinformatics181486ndash1493

Bloom J D and C Adami 2003 Apparent dependence ofprotein evolutionary rate on number of interactions is linkedto biases in protein-protein interactions data sets BMC EvolBiol 321

Britten R J and E H Davidson 1969 Gene regulation forhigher cells a theory Science 165349ndash357

Carbone A A Zinovyev and F Kepes 2003 Codon adaptationindex as a measure of dominating codon bias Bioinformatics192005ndash2015

Darwin C 1859 On the origin of species John Murray LondonDuret L and D Mouchiroud 2000 Determinants of sub-

stitution rates in mammalian genes expression pattern affectsselection intensity but not mutation rate Mol Biol Evol1768ndash74

Eisen M B P T Spellman P O Brown and D Botstein1998 Cluster analysis and display of genome-wide expressionpatterns Proc Natl Acad Sci USA 9514863ndash14868

Farris J S 1977 Phylogenetic analysis under Dollorsquos law SystZool 2677ndash88

Fraser H B A E Hirsh L M Steinmetz C Scharfe andM W Feldman 2002 Evolutionary rate in the protein inter-action network Science 296750ndash752

Fraser H B D P Wall and A E Hirsh 2003 A simpledependence between protein evolution rate and the number ofprotein-protein interactions BMC Evol Biol 311

Giot L J S Bader C Brouwer et al 2003 A protein inter-action map of Drosophila melanogaster Science 30217271736

Gu Z D Nicolae H H Lu and W H Li 2002 Rapiddivergence in expression between duplicate genes inferredfrom microarray data Trends Genet 18609ndash613

Hedges S B H Chen S Kumar D Y Wang A S Thompsonand H Watanabe 2001 A genomic timescale for the origin ofeukaryotes BMC Evol Biol 14

Higgins D G J D Thompson and T J Gibson 1996 UsingCLUSTAL for multiple sequence alignments MethodsEnzymol 266383ndash402

Hirsh A E and H B Fraser 2001 Protein dispensability andrate of evolution Nature 4111046ndash1049

Ho Y A Gruhler A Heilbut et al 2002 Systematic iden-tification of protein complexes in Saccharomyces cerevisiaeby mass spectrometry Nature 415180ndash183

Evolution and the Human Gene Coexpression Network 2069

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Hoyle DCMRattray R Jupp andABrass 2002Making senseof microarray data distributions Bioinformatics 18576ndash584

Jeong H S P Mason A L Barabasi and Z N Oltvai 2001Lethality and centrality in protein networks Nature 41141ndash42

Jeong H B Tombor R Albert Z N Oltvai and A LBarabasi 2000 The large-scale organization of metabolicnetworks Nature 407651ndash654

Jordan I K I B Rogozin Y I Wolf and E V Koonin 2002Essential genes are more evolutionarily conserved than arenonessential genes in bacteria Genome Res 12962ndash968

Jordan I K Y I Wolf and E V Koonin 2003 No simpledependence between protein evolution rate and the number ofprotein-protein interactions only the most prolific interactorstend to evolve slowly BMC Evol Biol 31

Jukes T H and C R Cantor 1969 Evolution of proteinmolecules Pp 21ndash132 in H N Munro ed Mammalianprotein metabolism Academic New York

Kamath R S A G Fraser Y Dong et al 2003 Systematicfunctional analysis of the Caenorhabditis elegans genomeusing RNAi Nature 421231ndash237

Karev G P Y I Wolf A Y Rzhetsky F S Berezovskayaand E V Koonin 2002 Birth and death of protein domainsa simple model of evolution explains power law behaviorBMC Evol Biol 218

Khaitovich P G Weiss M Lachmann I Hellman W EnardB Muetzel U Wiekner W Ansorge and S Paabo 2004 Aneutral model of transcriptome evolution PLOS Biol20682ndash0689

King M C and A C Wilson 1975 Evolution at two levels inhumans and chimpanzees Science 188107ndash116

Koonin E V N D Fedorova J D Jackson et al 2004 Acomprehensive evolutionary classification of proteins encodedin complete eukaryotic genomes Genome Biol 5R7

Koonin E V Y I Wolf and G P Karev 2002 The structureof the protein universe and genome evolution Nature420218ndash223

Krylov D M Y I Wolf I B Rogozin and E V Koonin2003 Gene loss protein sequence divergence gene dispens-ability expression level and interactivity are correlated ineukaryotic evolution Genome Res 132229ndash2235

Kuznetsov V A G D Knott and R F Bonner 2002 Generalstatistics of stochastic process of gene expression ineukaryotic cells Genetics 1611321ndash1332

Lander E S L M Linton B Birren et al 2001 Initialsequencing and analysis of the human genome Nature 409860ndash921

Li S C M Armstrong N Bertin et al 2004 A map of theinteractome network of the metazoan C elegans Science 303540ndash543

Li W H 1997 Molecular evolution Sinauer AssociatesSunderland Mass

Luscombe N M J Qian Z Zhang T Johnson and MGerstein 2002 The dominance of the population by a selected

few power-law behaviour applies to a wide variety ofgenomic properties Genome Biol3RESEARCH0040

Makova K D and W H Li 2003 Divergence in the spatialpattern of gene expression between human duplicate genesGenome Res 131638ndash1645

Nei M and T Gojobori 1986 Simple methods for estimatingthe numbers of synonymous and nonsynonymous nucleotidesubstitutions Mol Biol Evol 3418ndash426

Pal C B Papp and L D Hurst 2001 Highly expressed genesin yeast evolve slowly Genetics 158927ndash931

mdashmdashmdash 2003 Genomic function rate of evolution and genedispensability Nature 421496ndash497

Pennisi E 2003 Systems biology tracing lifersquos circuitryScience 3021646ndash1649

Ravasz E A L Somera D A Mongru Z N Oltvai andA L Barabasi 2002 Hierarchical organization of modularityin metabolic networks Science 2971551ndash1555

Rzhetsky A and S M Gomez 2001 Birth of scale-freemolecular networks and the number of distinct DNA andprotein domains per genome Bioinformatics 17988ndash996

Su A I M P Cooke K A Ching et al 2002 Large-scaleanalysis of the human and mouse transcriptomes Proc NatlAcad Sci USA 994465ndash4470

Tatusov R L N D Fedorova J D Jackson et al 2003 TheCOG database an updated version includes eukaryotes BMCBioinformatics 441

Ueda H R S Hayshi S Matsuyama T Yomo S HashimotoS A Kay J B Hogenesch and M Iino 2004 Universalityand flexibility in gene expression from bacteria to humanProc Natl Acad Sci USA 1013765ndash3769

Wagner A 2000 Decoupled evolution of coding region andmRNA expression patterns after gene duplication implica-tions for the neutralist-selectionist debate Proc Natl AcadSci USA 976579ndash6584

mdashmdashmdash 2003 How the global structure of protein interaction net-works evolves Proc R Soc Lond B Biol Sci 270457ndash466

Waterston R H K Lindblad-Toh E Birney et al 2002 Initialsequencing and comparative analysis of the mouse genomeNature 420520ndash562

Yanai I D Graur and R Ophir 2004 Incongruent expressionprofiles between human and mouse orthologous genes suggestwidespread neutral evolution of transcription control OMICS815ndash24

Yang J ZGu andWHLi 2003Rate of protein evolution versusfitness effect of gene deletion Mol Biol Evol 20772ndash774

Yang Z 1997 PAML a programpackage for phylogenetic analysisby maximum likelihood Comput Appl Biosci 13555ndash556

Zhang L and W H Li 2004 Mammalian housekeeping genesevolve more slowly than tissue-specific genes Mol BiolEvol 21236ndash239

Peer Bork Associate Editor

Accepted July 14 2004

2070 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Page 11: Conservation and Coevolution in the Scale-Free Human Gene

interaction networks which showed that the networkstructure is not predominantly shaped by duplication(Wagner 2003)

Our results show that overall levels of expressionand more specific topological properties of the gene co-expression network are clearly related to the rates ofsequence evolution highly connected network hubs tendto evolve slowly and genes that are coexpressed showa strong tendency to evolve at similar rates Theserelationships most likely reflect purifying selection whichappears to act primarily at the level of the proteinsequence Generally these results are compatible withthe notion of greater biological importance of highlyconnected nodes of biological networks (Jeong et al2001) However two of the findings reported here appearto be distinctly surprising Firstly we found that theconnection between gene coexpression network topologyand sequence evolution held for both nonsynonymous andsynonymous sites in the coding sequence and 39 UTR butnot for the 59 UTR Thus the constraints on 59 UTRevolution appear to be unrelated to expression regulationas analyzed here Secondly we found that although theexpression profiles of orthologs tend to be stronglycorrelated they diverged at roughly the same rate acrossa wide range of sequence evolution rates Thus unlike theproperties of the gene coexpression network evolution ofan individual genersquos expression regulation after speciationseems to be uncoupled from the functional constraints onthe genersquos sequence This is generally compatible withthe pregenomic notion that regulatory evolution could bethe decisive factor of biological diversification betweenspecies (Britten and Davidson 1969 King and Wilson1975) and can be taken to suggest that sequence diver-gence and expression pattern divergence are governed bydistinct forces

Recent studies have reported the rapid diversificationof gene expression patterns and suggested that this mightreflect neutral evolution of transcriptional regulation(Khaitovich et al 2004 Yanai Graur and Ophir 2004)The lack of correlation between sequence divergence andexpression profile divergence demonstrated here could bedeemed as compatible with the neutral model wherebytranscription profiles of orthologs diverge rapidly uponspeciation until they reach a basal level of similarity that isthen maintained by purifying selection However it istempting to speculate as to a distinct role for naturalselection in driving expression pattern divergence Perhapswhereas sequence divergence levels are determined largelyby purifying selection the effects of adaptive (diversifying)selection are more prevalent at the level of gene expressionUnder this hypothetical scenario although the basal level ofcorrelation between orthologs tends to persist probablyreflecting the conservation of the general function theaspects of the gene expression patterns that reflect species-specific changes governed in part by short degeneratetranscription factorndashbinding sites would be expected todiverge rapidly Indeed our analysis of the relationshipbetween gene coexpression in this work and geneduplication as well as previous results (Gu et al 2002)suggest rapid divergence of expression patterns afterduplication Once the expression patterns of homologous

correlation 01

02

03

04

05

06

001

002

003

004

005

007

009

012

019

052

01

02

03

04

05

06

040

046

050

055

058

062

065

071

079

127

01

02

03

04

05

06

017

024

032

037

045

053

063

072

088

137

01

02

03

04

05

06

017

023

029

034

040

044

050

059

073

120

3rsquo U

TR d

5rsquo

UTR

d

dNdS

FIG9mdash

Thelack

ofdependence

betweenhuman-m

ouse

generegulatory

divergence

andsequence

divergenceAveragecorrelationcoefficientvalues

(r)betweenexpressionprofilesofhuman-m

ouse

orthologousgeneareshownfor10ascendingbinsofhuman-m

ouse

orthologousgenesubstitutionrates

2068 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

genes (paralogs or orthologs) diverge convergent evolu-tion which is a hallmark of adaptation and is widely evidentfor phenotypic characters could cause unrelated genes toachieve similar expression patterns This convergence couldbe related to the rapid de novo generation of transcriptionfactorndashbinding sites in promoters that lead to slight changesin expression patterns If the functionally relevant aspectsof the ancestral expression pattern are maintained duringthis process subtle expression pattern changes would besimultaneously invisible to purifying selection and serve asthe raw material for repeated trials of adaptive selectionThis hypothesis yields a specific prediction with regard tothe evolution of promoter regions that is currently beinginvestigated Expression profiles are predicted to be morecorrelated with the distributions of specific transcriptionfactorndashbinding sites along promoters than with the overallsequence divergence (ie relatedness) between promoters

Acknowledgments

We thank Igor Rogozin David Landsman FyodorKondrashov Itai Yanai and Katerina Makova for helpfuldiscussions

Supplementary Material

Supplementary Figure 1mdashThe dependence betweenexpression breadth level and substitution rates of mousegenes (andashd) Average (6 standard error) expressionbreadth values for 10 ascending bins of human-mouseorthologous gene substitution rates (endashh) Average (6standard error) expression level values for 10 ascendingbins of human-mouse orthologous gene substitution rates

Supplementary Figure 2mdashFrequency distributions ofhuman-mouse substitution rate ratios Ratios shown inlog scale (a) d(59 UTR)d(39 UTR) d(59 UTR)d(39 UTR)is shown in log scale In 43 of genes d(59 UTR) d(39UTR) whereas in 57 d(39 UTR) d(59 UTR) (b)d(59 UTR)dS (c) d(39 UTRdS)

Supplementary Figure 3mdashFrequency distributions ofthe number of links per node in human gene coexpressionnetworks For each network the correlation coefficientvalue (r) threshold used to determine whether any twogenes (nodes) are considered to be coexpressed (linked)are shown above the plot The slope of the linear trend linethat fits the data and the r2 value indicating the goodness offit to the power law are shown for each network plot

Supplementary Figure 4mdashTissue-specific expressionpatterns for human gene coexpression network (r 09)clusters Cluster expression patterns 1 adult and fetal liver2 fetal liver 3 testis 4 pancreas 5 testis and umbilicalvein 6 various brain tissues 7 salivary gland 8 adultliver 9 thymus 10 spleen 11 lung 12 cerebellum

Supplementary Figure 5mdashRelationship between themouse coexpression network topology and substitutionrates (a) The dependence between the node degree andsubstitution rate Average (6 standard error) numbers ofcoexpressed genes (node degree for r 07) per gene for10 ascending bins of human-mouse orthologous gene sub-stitution rates are shown (b) Coexpressedmouse genes havesimilar substitution rates Fractions of pairwise correlation

coefficient values where r 07 between mouse genes areshown for 10 ascending bins of pairwise human-mouseorthologous gene substitution rate differences

Literature Cited

Agrawal H 2002 Extreme self-organization in networksconstructed from gene expression data Phys Rev Lett89268702

Ashburner M C A Ball J A Blake et al 2000 Geneontology tool for the unification of biology The GeneOntology Consortium Nat Genet 2525ndash29

Barabasi A L and R Albert 1999 Emergence of scaling inrandom networks Science 286509ndash512

Barabasi A L and Z N Oltvai 2004 Network biologyunderstanding the cellrsquos functional organization Nat RevGenet 5101ndash113

Bergmann S J Ihmels and N Barkai 2004 Similarities anddifferences in genome-wide expression data of six organismsPLoS Biol 20085ndash0093

Bhan A D J Galas and T G Dewey 2002 A duplicationgrowth model of gene expression networks Bioinformatics181486ndash1493

Bloom J D and C Adami 2003 Apparent dependence ofprotein evolutionary rate on number of interactions is linkedto biases in protein-protein interactions data sets BMC EvolBiol 321

Britten R J and E H Davidson 1969 Gene regulation forhigher cells a theory Science 165349ndash357

Carbone A A Zinovyev and F Kepes 2003 Codon adaptationindex as a measure of dominating codon bias Bioinformatics192005ndash2015

Darwin C 1859 On the origin of species John Murray LondonDuret L and D Mouchiroud 2000 Determinants of sub-

stitution rates in mammalian genes expression pattern affectsselection intensity but not mutation rate Mol Biol Evol1768ndash74

Eisen M B P T Spellman P O Brown and D Botstein1998 Cluster analysis and display of genome-wide expressionpatterns Proc Natl Acad Sci USA 9514863ndash14868

Farris J S 1977 Phylogenetic analysis under Dollorsquos law SystZool 2677ndash88

Fraser H B A E Hirsh L M Steinmetz C Scharfe andM W Feldman 2002 Evolutionary rate in the protein inter-action network Science 296750ndash752

Fraser H B D P Wall and A E Hirsh 2003 A simpledependence between protein evolution rate and the number ofprotein-protein interactions BMC Evol Biol 311

Giot L J S Bader C Brouwer et al 2003 A protein inter-action map of Drosophila melanogaster Science 30217271736

Gu Z D Nicolae H H Lu and W H Li 2002 Rapiddivergence in expression between duplicate genes inferredfrom microarray data Trends Genet 18609ndash613

Hedges S B H Chen S Kumar D Y Wang A S Thompsonand H Watanabe 2001 A genomic timescale for the origin ofeukaryotes BMC Evol Biol 14

Higgins D G J D Thompson and T J Gibson 1996 UsingCLUSTAL for multiple sequence alignments MethodsEnzymol 266383ndash402

Hirsh A E and H B Fraser 2001 Protein dispensability andrate of evolution Nature 4111046ndash1049

Ho Y A Gruhler A Heilbut et al 2002 Systematic iden-tification of protein complexes in Saccharomyces cerevisiaeby mass spectrometry Nature 415180ndash183

Evolution and the Human Gene Coexpression Network 2069

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Hoyle DCMRattray R Jupp andABrass 2002Making senseof microarray data distributions Bioinformatics 18576ndash584

Jeong H S P Mason A L Barabasi and Z N Oltvai 2001Lethality and centrality in protein networks Nature 41141ndash42

Jeong H B Tombor R Albert Z N Oltvai and A LBarabasi 2000 The large-scale organization of metabolicnetworks Nature 407651ndash654

Jordan I K I B Rogozin Y I Wolf and E V Koonin 2002Essential genes are more evolutionarily conserved than arenonessential genes in bacteria Genome Res 12962ndash968

Jordan I K Y I Wolf and E V Koonin 2003 No simpledependence between protein evolution rate and the number ofprotein-protein interactions only the most prolific interactorstend to evolve slowly BMC Evol Biol 31

Jukes T H and C R Cantor 1969 Evolution of proteinmolecules Pp 21ndash132 in H N Munro ed Mammalianprotein metabolism Academic New York

Kamath R S A G Fraser Y Dong et al 2003 Systematicfunctional analysis of the Caenorhabditis elegans genomeusing RNAi Nature 421231ndash237

Karev G P Y I Wolf A Y Rzhetsky F S Berezovskayaand E V Koonin 2002 Birth and death of protein domainsa simple model of evolution explains power law behaviorBMC Evol Biol 218

Khaitovich P G Weiss M Lachmann I Hellman W EnardB Muetzel U Wiekner W Ansorge and S Paabo 2004 Aneutral model of transcriptome evolution PLOS Biol20682ndash0689

King M C and A C Wilson 1975 Evolution at two levels inhumans and chimpanzees Science 188107ndash116

Koonin E V N D Fedorova J D Jackson et al 2004 Acomprehensive evolutionary classification of proteins encodedin complete eukaryotic genomes Genome Biol 5R7

Koonin E V Y I Wolf and G P Karev 2002 The structureof the protein universe and genome evolution Nature420218ndash223

Krylov D M Y I Wolf I B Rogozin and E V Koonin2003 Gene loss protein sequence divergence gene dispens-ability expression level and interactivity are correlated ineukaryotic evolution Genome Res 132229ndash2235

Kuznetsov V A G D Knott and R F Bonner 2002 Generalstatistics of stochastic process of gene expression ineukaryotic cells Genetics 1611321ndash1332

Lander E S L M Linton B Birren et al 2001 Initialsequencing and analysis of the human genome Nature 409860ndash921

Li S C M Armstrong N Bertin et al 2004 A map of theinteractome network of the metazoan C elegans Science 303540ndash543

Li W H 1997 Molecular evolution Sinauer AssociatesSunderland Mass

Luscombe N M J Qian Z Zhang T Johnson and MGerstein 2002 The dominance of the population by a selected

few power-law behaviour applies to a wide variety ofgenomic properties Genome Biol3RESEARCH0040

Makova K D and W H Li 2003 Divergence in the spatialpattern of gene expression between human duplicate genesGenome Res 131638ndash1645

Nei M and T Gojobori 1986 Simple methods for estimatingthe numbers of synonymous and nonsynonymous nucleotidesubstitutions Mol Biol Evol 3418ndash426

Pal C B Papp and L D Hurst 2001 Highly expressed genesin yeast evolve slowly Genetics 158927ndash931

mdashmdashmdash 2003 Genomic function rate of evolution and genedispensability Nature 421496ndash497

Pennisi E 2003 Systems biology tracing lifersquos circuitryScience 3021646ndash1649

Ravasz E A L Somera D A Mongru Z N Oltvai andA L Barabasi 2002 Hierarchical organization of modularityin metabolic networks Science 2971551ndash1555

Rzhetsky A and S M Gomez 2001 Birth of scale-freemolecular networks and the number of distinct DNA andprotein domains per genome Bioinformatics 17988ndash996

Su A I M P Cooke K A Ching et al 2002 Large-scaleanalysis of the human and mouse transcriptomes Proc NatlAcad Sci USA 994465ndash4470

Tatusov R L N D Fedorova J D Jackson et al 2003 TheCOG database an updated version includes eukaryotes BMCBioinformatics 441

Ueda H R S Hayshi S Matsuyama T Yomo S HashimotoS A Kay J B Hogenesch and M Iino 2004 Universalityand flexibility in gene expression from bacteria to humanProc Natl Acad Sci USA 1013765ndash3769

Wagner A 2000 Decoupled evolution of coding region andmRNA expression patterns after gene duplication implica-tions for the neutralist-selectionist debate Proc Natl AcadSci USA 976579ndash6584

mdashmdashmdash 2003 How the global structure of protein interaction net-works evolves Proc R Soc Lond B Biol Sci 270457ndash466

Waterston R H K Lindblad-Toh E Birney et al 2002 Initialsequencing and comparative analysis of the mouse genomeNature 420520ndash562

Yanai I D Graur and R Ophir 2004 Incongruent expressionprofiles between human and mouse orthologous genes suggestwidespread neutral evolution of transcription control OMICS815ndash24

Yang J ZGu andWHLi 2003Rate of protein evolution versusfitness effect of gene deletion Mol Biol Evol 20772ndash774

Yang Z 1997 PAML a programpackage for phylogenetic analysisby maximum likelihood Comput Appl Biosci 13555ndash556

Zhang L and W H Li 2004 Mammalian housekeeping genesevolve more slowly than tissue-specific genes Mol BiolEvol 21236ndash239

Peer Bork Associate Editor

Accepted July 14 2004

2070 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Page 12: Conservation and Coevolution in the Scale-Free Human Gene

genes (paralogs or orthologs) diverge convergent evolu-tion which is a hallmark of adaptation and is widely evidentfor phenotypic characters could cause unrelated genes toachieve similar expression patterns This convergence couldbe related to the rapid de novo generation of transcriptionfactorndashbinding sites in promoters that lead to slight changesin expression patterns If the functionally relevant aspectsof the ancestral expression pattern are maintained duringthis process subtle expression pattern changes would besimultaneously invisible to purifying selection and serve asthe raw material for repeated trials of adaptive selectionThis hypothesis yields a specific prediction with regard tothe evolution of promoter regions that is currently beinginvestigated Expression profiles are predicted to be morecorrelated with the distributions of specific transcriptionfactorndashbinding sites along promoters than with the overallsequence divergence (ie relatedness) between promoters

Acknowledgments

We thank Igor Rogozin David Landsman FyodorKondrashov Itai Yanai and Katerina Makova for helpfuldiscussions

Supplementary Material

Supplementary Figure 1mdashThe dependence betweenexpression breadth level and substitution rates of mousegenes (andashd) Average (6 standard error) expressionbreadth values for 10 ascending bins of human-mouseorthologous gene substitution rates (endashh) Average (6standard error) expression level values for 10 ascendingbins of human-mouse orthologous gene substitution rates

Supplementary Figure 2mdashFrequency distributions ofhuman-mouse substitution rate ratios Ratios shown inlog scale (a) d(59 UTR)d(39 UTR) d(59 UTR)d(39 UTR)is shown in log scale In 43 of genes d(59 UTR) d(39UTR) whereas in 57 d(39 UTR) d(59 UTR) (b)d(59 UTR)dS (c) d(39 UTRdS)

Supplementary Figure 3mdashFrequency distributions ofthe number of links per node in human gene coexpressionnetworks For each network the correlation coefficientvalue (r) threshold used to determine whether any twogenes (nodes) are considered to be coexpressed (linked)are shown above the plot The slope of the linear trend linethat fits the data and the r2 value indicating the goodness offit to the power law are shown for each network plot

Supplementary Figure 4mdashTissue-specific expressionpatterns for human gene coexpression network (r 09)clusters Cluster expression patterns 1 adult and fetal liver2 fetal liver 3 testis 4 pancreas 5 testis and umbilicalvein 6 various brain tissues 7 salivary gland 8 adultliver 9 thymus 10 spleen 11 lung 12 cerebellum

Supplementary Figure 5mdashRelationship between themouse coexpression network topology and substitutionrates (a) The dependence between the node degree andsubstitution rate Average (6 standard error) numbers ofcoexpressed genes (node degree for r 07) per gene for10 ascending bins of human-mouse orthologous gene sub-stitution rates are shown (b) Coexpressedmouse genes havesimilar substitution rates Fractions of pairwise correlation

coefficient values where r 07 between mouse genes areshown for 10 ascending bins of pairwise human-mouseorthologous gene substitution rate differences

Literature Cited

Agrawal H 2002 Extreme self-organization in networksconstructed from gene expression data Phys Rev Lett89268702

Ashburner M C A Ball J A Blake et al 2000 Geneontology tool for the unification of biology The GeneOntology Consortium Nat Genet 2525ndash29

Barabasi A L and R Albert 1999 Emergence of scaling inrandom networks Science 286509ndash512

Barabasi A L and Z N Oltvai 2004 Network biologyunderstanding the cellrsquos functional organization Nat RevGenet 5101ndash113

Bergmann S J Ihmels and N Barkai 2004 Similarities anddifferences in genome-wide expression data of six organismsPLoS Biol 20085ndash0093

Bhan A D J Galas and T G Dewey 2002 A duplicationgrowth model of gene expression networks Bioinformatics181486ndash1493

Bloom J D and C Adami 2003 Apparent dependence ofprotein evolutionary rate on number of interactions is linkedto biases in protein-protein interactions data sets BMC EvolBiol 321

Britten R J and E H Davidson 1969 Gene regulation forhigher cells a theory Science 165349ndash357

Carbone A A Zinovyev and F Kepes 2003 Codon adaptationindex as a measure of dominating codon bias Bioinformatics192005ndash2015

Darwin C 1859 On the origin of species John Murray LondonDuret L and D Mouchiroud 2000 Determinants of sub-

stitution rates in mammalian genes expression pattern affectsselection intensity but not mutation rate Mol Biol Evol1768ndash74

Eisen M B P T Spellman P O Brown and D Botstein1998 Cluster analysis and display of genome-wide expressionpatterns Proc Natl Acad Sci USA 9514863ndash14868

Farris J S 1977 Phylogenetic analysis under Dollorsquos law SystZool 2677ndash88

Fraser H B A E Hirsh L M Steinmetz C Scharfe andM W Feldman 2002 Evolutionary rate in the protein inter-action network Science 296750ndash752

Fraser H B D P Wall and A E Hirsh 2003 A simpledependence between protein evolution rate and the number ofprotein-protein interactions BMC Evol Biol 311

Giot L J S Bader C Brouwer et al 2003 A protein inter-action map of Drosophila melanogaster Science 30217271736

Gu Z D Nicolae H H Lu and W H Li 2002 Rapiddivergence in expression between duplicate genes inferredfrom microarray data Trends Genet 18609ndash613

Hedges S B H Chen S Kumar D Y Wang A S Thompsonand H Watanabe 2001 A genomic timescale for the origin ofeukaryotes BMC Evol Biol 14

Higgins D G J D Thompson and T J Gibson 1996 UsingCLUSTAL for multiple sequence alignments MethodsEnzymol 266383ndash402

Hirsh A E and H B Fraser 2001 Protein dispensability andrate of evolution Nature 4111046ndash1049

Ho Y A Gruhler A Heilbut et al 2002 Systematic iden-tification of protein complexes in Saccharomyces cerevisiaeby mass spectrometry Nature 415180ndash183

Evolution and the Human Gene Coexpression Network 2069

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Hoyle DCMRattray R Jupp andABrass 2002Making senseof microarray data distributions Bioinformatics 18576ndash584

Jeong H S P Mason A L Barabasi and Z N Oltvai 2001Lethality and centrality in protein networks Nature 41141ndash42

Jeong H B Tombor R Albert Z N Oltvai and A LBarabasi 2000 The large-scale organization of metabolicnetworks Nature 407651ndash654

Jordan I K I B Rogozin Y I Wolf and E V Koonin 2002Essential genes are more evolutionarily conserved than arenonessential genes in bacteria Genome Res 12962ndash968

Jordan I K Y I Wolf and E V Koonin 2003 No simpledependence between protein evolution rate and the number ofprotein-protein interactions only the most prolific interactorstend to evolve slowly BMC Evol Biol 31

Jukes T H and C R Cantor 1969 Evolution of proteinmolecules Pp 21ndash132 in H N Munro ed Mammalianprotein metabolism Academic New York

Kamath R S A G Fraser Y Dong et al 2003 Systematicfunctional analysis of the Caenorhabditis elegans genomeusing RNAi Nature 421231ndash237

Karev G P Y I Wolf A Y Rzhetsky F S Berezovskayaand E V Koonin 2002 Birth and death of protein domainsa simple model of evolution explains power law behaviorBMC Evol Biol 218

Khaitovich P G Weiss M Lachmann I Hellman W EnardB Muetzel U Wiekner W Ansorge and S Paabo 2004 Aneutral model of transcriptome evolution PLOS Biol20682ndash0689

King M C and A C Wilson 1975 Evolution at two levels inhumans and chimpanzees Science 188107ndash116

Koonin E V N D Fedorova J D Jackson et al 2004 Acomprehensive evolutionary classification of proteins encodedin complete eukaryotic genomes Genome Biol 5R7

Koonin E V Y I Wolf and G P Karev 2002 The structureof the protein universe and genome evolution Nature420218ndash223

Krylov D M Y I Wolf I B Rogozin and E V Koonin2003 Gene loss protein sequence divergence gene dispens-ability expression level and interactivity are correlated ineukaryotic evolution Genome Res 132229ndash2235

Kuznetsov V A G D Knott and R F Bonner 2002 Generalstatistics of stochastic process of gene expression ineukaryotic cells Genetics 1611321ndash1332

Lander E S L M Linton B Birren et al 2001 Initialsequencing and analysis of the human genome Nature 409860ndash921

Li S C M Armstrong N Bertin et al 2004 A map of theinteractome network of the metazoan C elegans Science 303540ndash543

Li W H 1997 Molecular evolution Sinauer AssociatesSunderland Mass

Luscombe N M J Qian Z Zhang T Johnson and MGerstein 2002 The dominance of the population by a selected

few power-law behaviour applies to a wide variety ofgenomic properties Genome Biol3RESEARCH0040

Makova K D and W H Li 2003 Divergence in the spatialpattern of gene expression between human duplicate genesGenome Res 131638ndash1645

Nei M and T Gojobori 1986 Simple methods for estimatingthe numbers of synonymous and nonsynonymous nucleotidesubstitutions Mol Biol Evol 3418ndash426

Pal C B Papp and L D Hurst 2001 Highly expressed genesin yeast evolve slowly Genetics 158927ndash931

mdashmdashmdash 2003 Genomic function rate of evolution and genedispensability Nature 421496ndash497

Pennisi E 2003 Systems biology tracing lifersquos circuitryScience 3021646ndash1649

Ravasz E A L Somera D A Mongru Z N Oltvai andA L Barabasi 2002 Hierarchical organization of modularityin metabolic networks Science 2971551ndash1555

Rzhetsky A and S M Gomez 2001 Birth of scale-freemolecular networks and the number of distinct DNA andprotein domains per genome Bioinformatics 17988ndash996

Su A I M P Cooke K A Ching et al 2002 Large-scaleanalysis of the human and mouse transcriptomes Proc NatlAcad Sci USA 994465ndash4470

Tatusov R L N D Fedorova J D Jackson et al 2003 TheCOG database an updated version includes eukaryotes BMCBioinformatics 441

Ueda H R S Hayshi S Matsuyama T Yomo S HashimotoS A Kay J B Hogenesch and M Iino 2004 Universalityand flexibility in gene expression from bacteria to humanProc Natl Acad Sci USA 1013765ndash3769

Wagner A 2000 Decoupled evolution of coding region andmRNA expression patterns after gene duplication implica-tions for the neutralist-selectionist debate Proc Natl AcadSci USA 976579ndash6584

mdashmdashmdash 2003 How the global structure of protein interaction net-works evolves Proc R Soc Lond B Biol Sci 270457ndash466

Waterston R H K Lindblad-Toh E Birney et al 2002 Initialsequencing and comparative analysis of the mouse genomeNature 420520ndash562

Yanai I D Graur and R Ophir 2004 Incongruent expressionprofiles between human and mouse orthologous genes suggestwidespread neutral evolution of transcription control OMICS815ndash24

Yang J ZGu andWHLi 2003Rate of protein evolution versusfitness effect of gene deletion Mol Biol Evol 20772ndash774

Yang Z 1997 PAML a programpackage for phylogenetic analysisby maximum likelihood Comput Appl Biosci 13555ndash556

Zhang L and W H Li 2004 Mammalian housekeeping genesevolve more slowly than tissue-specific genes Mol BiolEvol 21236ndash239

Peer Bork Associate Editor

Accepted July 14 2004

2070 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021

Page 13: Conservation and Coevolution in the Scale-Free Human Gene

Hoyle DCMRattray R Jupp andABrass 2002Making senseof microarray data distributions Bioinformatics 18576ndash584

Jeong H S P Mason A L Barabasi and Z N Oltvai 2001Lethality and centrality in protein networks Nature 41141ndash42

Jeong H B Tombor R Albert Z N Oltvai and A LBarabasi 2000 The large-scale organization of metabolicnetworks Nature 407651ndash654

Jordan I K I B Rogozin Y I Wolf and E V Koonin 2002Essential genes are more evolutionarily conserved than arenonessential genes in bacteria Genome Res 12962ndash968

Jordan I K Y I Wolf and E V Koonin 2003 No simpledependence between protein evolution rate and the number ofprotein-protein interactions only the most prolific interactorstend to evolve slowly BMC Evol Biol 31

Jukes T H and C R Cantor 1969 Evolution of proteinmolecules Pp 21ndash132 in H N Munro ed Mammalianprotein metabolism Academic New York

Kamath R S A G Fraser Y Dong et al 2003 Systematicfunctional analysis of the Caenorhabditis elegans genomeusing RNAi Nature 421231ndash237

Karev G P Y I Wolf A Y Rzhetsky F S Berezovskayaand E V Koonin 2002 Birth and death of protein domainsa simple model of evolution explains power law behaviorBMC Evol Biol 218

Khaitovich P G Weiss M Lachmann I Hellman W EnardB Muetzel U Wiekner W Ansorge and S Paabo 2004 Aneutral model of transcriptome evolution PLOS Biol20682ndash0689

King M C and A C Wilson 1975 Evolution at two levels inhumans and chimpanzees Science 188107ndash116

Koonin E V N D Fedorova J D Jackson et al 2004 Acomprehensive evolutionary classification of proteins encodedin complete eukaryotic genomes Genome Biol 5R7

Koonin E V Y I Wolf and G P Karev 2002 The structureof the protein universe and genome evolution Nature420218ndash223

Krylov D M Y I Wolf I B Rogozin and E V Koonin2003 Gene loss protein sequence divergence gene dispens-ability expression level and interactivity are correlated ineukaryotic evolution Genome Res 132229ndash2235

Kuznetsov V A G D Knott and R F Bonner 2002 Generalstatistics of stochastic process of gene expression ineukaryotic cells Genetics 1611321ndash1332

Lander E S L M Linton B Birren et al 2001 Initialsequencing and analysis of the human genome Nature 409860ndash921

Li S C M Armstrong N Bertin et al 2004 A map of theinteractome network of the metazoan C elegans Science 303540ndash543

Li W H 1997 Molecular evolution Sinauer AssociatesSunderland Mass

Luscombe N M J Qian Z Zhang T Johnson and MGerstein 2002 The dominance of the population by a selected

few power-law behaviour applies to a wide variety ofgenomic properties Genome Biol3RESEARCH0040

Makova K D and W H Li 2003 Divergence in the spatialpattern of gene expression between human duplicate genesGenome Res 131638ndash1645

Nei M and T Gojobori 1986 Simple methods for estimatingthe numbers of synonymous and nonsynonymous nucleotidesubstitutions Mol Biol Evol 3418ndash426

Pal C B Papp and L D Hurst 2001 Highly expressed genesin yeast evolve slowly Genetics 158927ndash931

mdashmdashmdash 2003 Genomic function rate of evolution and genedispensability Nature 421496ndash497

Pennisi E 2003 Systems biology tracing lifersquos circuitryScience 3021646ndash1649

Ravasz E A L Somera D A Mongru Z N Oltvai andA L Barabasi 2002 Hierarchical organization of modularityin metabolic networks Science 2971551ndash1555

Rzhetsky A and S M Gomez 2001 Birth of scale-freemolecular networks and the number of distinct DNA andprotein domains per genome Bioinformatics 17988ndash996

Su A I M P Cooke K A Ching et al 2002 Large-scaleanalysis of the human and mouse transcriptomes Proc NatlAcad Sci USA 994465ndash4470

Tatusov R L N D Fedorova J D Jackson et al 2003 TheCOG database an updated version includes eukaryotes BMCBioinformatics 441

Ueda H R S Hayshi S Matsuyama T Yomo S HashimotoS A Kay J B Hogenesch and M Iino 2004 Universalityand flexibility in gene expression from bacteria to humanProc Natl Acad Sci USA 1013765ndash3769

Wagner A 2000 Decoupled evolution of coding region andmRNA expression patterns after gene duplication implica-tions for the neutralist-selectionist debate Proc Natl AcadSci USA 976579ndash6584

mdashmdashmdash 2003 How the global structure of protein interaction net-works evolves Proc R Soc Lond B Biol Sci 270457ndash466

Waterston R H K Lindblad-Toh E Birney et al 2002 Initialsequencing and comparative analysis of the mouse genomeNature 420520ndash562

Yanai I D Graur and R Ophir 2004 Incongruent expressionprofiles between human and mouse orthologous genes suggestwidespread neutral evolution of transcription control OMICS815ndash24

Yang J ZGu andWHLi 2003Rate of protein evolution versusfitness effect of gene deletion Mol Biol Evol 20772ndash774

Yang Z 1997 PAML a programpackage for phylogenetic analysisby maximum likelihood Comput Appl Biosci 13555ndash556

Zhang L and W H Li 2004 Mammalian housekeeping genesevolve more slowly than tissue-specific genes Mol BiolEvol 21236ndash239

Peer Bork Associate Editor

Accepted July 14 2004

2070 Jordan et al

Dow

nloaded from httpsacadem

icoupcomm

bearticle211120581148025 by guest on 06 Decem

ber 2021