refining transcriptional regulatory networks using network

RESEARCH Open Access

Refining transcriptional regulatory networks usingnetwork evolutionary models and gene historiesXiuwei Zhang, Bernard ME Moret*

Abstract

Background: Computational inference of transcriptional regulatory networks remains a challenging problem, inpart due to the lack of strong network models. In this paper we present evolutionary approaches to improve theinference of regulatory networks for a family of organisms by developing an evolutionary model for thesenetworks and taking advantage of established phylogenetic relationships among these organisms. In previouswork, we used a simple evolutionary model and provided extensive simulation results showing that phylogeneticinformation, combined with such a model, could be used to gain significant improvements on the performance ofcurrent inference algorithms.

Results: In this paper, we extend the evolutionary model so as to take into account gene duplications and losses,which are viewed as major drivers in the evolution of regulatory networks. We show how to adapt ourevolutionary approach to this new model and provide detailed simulation results, which show significantimprovement on the reference network inference algorithms. Different evolutionary histories for gene duplicationsand losses are studied, showing that our adapted approach is feasible under a broad range of conditions. We alsoprovide results on biological data (cis-regulatory modules for 12 species of Drosophila), confirming our simulationresults.

IntroductionTranscriptional regulatory networks are models of thecellular regulatory system that governs transcription.Because establishing the topology of the network frombench experiments is very difficult and time-consuming,regulatory networks are commonly inferred from gene-expression data. Various computational models, such asBoolean networks [1], Bayesian networks [2], dynamicBayesian networks (DBNs) [3], and differential equations[4,5], have been proposed for this purpose, along withassociated inference algorithms. Results, however, haveproved mixed: the high noise level in the data, the pau-city of well studied networks, and the many simplifica-tions made in the models all combine to make inferencedifficult.Bioinformatics has long used comparative and, more

generally, evolutionary approaches to improve the accu-racy of computational analyses. Work by Babu’s group

[6-8] on the evolution of regulatory networks in E. coliand S. cerevisiae has demonstrated the applicability ofsuch approaches to regulatory networks. They posit asimple evolutionary model for regulatory networks,under which network edges are simply added orremoved; they study how well such a model accountsfor the dynamic evolution of the two most studied regu-latory networks; they then investigate the evolution ofregulatory networks with gene duplications [8], conclud-ing that gene duplication plays a major role, in agree-ment with other work [9].Phylogenetic relationships are well established for

many groups of organisms; as the regulatory networksevolved along the same lineages, the phylogenetic rela-tionships informed this evolution and so can be used toimprove the inference of regulatory networks. Indeed,Bourque and Sankoff [10] developed an integrated algo-rithm to infer regulatory networks across a group ofspecies whose phylogenetic relationships are known;they used the phylogeny to reconstruct networks undera simple parsimony criterion. In previous work [11], wepresented two refinement algorithms, both based on

* Correspondence: [email protected] for Computational Biology and Bioinformatics, EPFL (EcolePolytechnique Fédérale de Lausanne), EPFL-IC-LCBB, INJ230, Station 14, CH-1015 Lausanne, Switzerland

Zhang and Moret Algorithms for Molecular Biology 2010, 5:1http://www.almob.org/content/5/1/1

© 2010 Zhang and Moret; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the CreativeCommons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly cited.

mailto:[email protected]

http://creativecommons.org/licenses/by/2.0

phylogenetic information and using a likelihood frame-work, that boost the performance of any chosen net-work inference method. On simulated data, the receiver-operator characteristic (ROC) curves for our algorithmsconsistently dominated those of the standard approachesused alone; under comparable conditions, they alsodominated the results from Bourque and Sankoff. Bothour previous approach and that of Bourque and Sankoffare based on an evolutionary model that considers onlyedge gains and losses, so that the networks must allhave the same number of genes (orthologous across allspecies). Moreover, the gain or loss of an edge in thatmodel is independent of any other event. However, thisprocess accounts for only a small part of regulatory net-work evolution; in particular, gene duplication is knownto be a crucial source of new genetic function and amechanism of evolutionary novelty [8,9].In this paper we present a model of network evolution

that takes into account gene duplications and losses andtheir effect on regulatory network structures. Such amodel provides a direct evolutionary mechanism foredge gains and losses, while also enabling broader appli-cation and more flexible parameterization. For example,in the networks to be refined, the genes can have differ-ent numbers of copies for different organisms. Withinthis broader framework, the phylogenetic informationthat we use lies on two levels: the evolution of genecontents of the networks and the regulatory interactionsof the networks. The former can be regarded as thebasis of the latter, and can be obtained by inferring thehistory of gene duplications and losses during evolution.We then extend our refinement algorithms [11] to han-dle this data and use different models of gene duplica-tions and losses to study their effect on the performanceof the refinement algorithms.Our experimental results confirm that our new algo-

rithms provide significant improvements over the baseinference algorithms, and support our analysis of theperformance of refinement algorithms under differentmodels of gene duplications and losses.

BackgroundOur refinement algorithms [11] work iteratively in twophases after an initialization step. First, we obtain theregulatory networks for the family of organisms; typi-cally, these networks are inferred from gene-expressiondata for these organisms, using standard inference meth-ods. We place these networks at the correspondingleaves of the phylogeny of the family of organisms andencode them into binary strings by simply concatenatingthe rows of their adjacency matrix. We then enter theiterative refinement cycle. In the first phase, we inferancestral networks for the phylogeny (strings labellinginternal nodes), using our own adaptation of the

FastML[12] algorithm; in the second phase, theseancestral networks are used to refine the leaf networks.These two phases are then repeated as needed. Ourrefinement algorithms are formulated within a maxi-mum likelihood (ML) framework and focused solely onrefinement–they are algorithmic boosters for one’s pre-ferred network inference method. Our new algorithmsretain the same general approach, but include manychanges to use the duplication/loss data and handle thenew model.Base network inference methodsWe use both DBN and differential equation models asbase inference methods in our experiments. WhenDBNs are used to model regulatory networks, an asso-ciated structure-learning algorithm is used to infer thenetworks from gene-expression data [3,13,14]; so as toavoid overly complex networks, a penalty on graphstructure complexity is usually added to the ML score,thereby reducing the number of false positive edges. In[11] we used a coefficient kp to adjust the weight of thispenalty and studied different tradeoffs between sensitiv-ity and specificity, yielding the optimization criterion logPr(D|G, ̂G ) - kp#G log N, where D denotes the datasetused in learning, G is the (structure of the) network,̂G is the ML estimate of parameters for G, #G is thenumber of free parameters of G, and N is the numberof samples in D.In models based on differential equations [4,5], a

regulatory network is represented by the equation sys-tem dx/dt = f(x(t))-Kx(t), where x(t) = (x1(t),..., xn(t))denotes the expression levels of the n genes and K (amatrix) denotes the degradation rates of the genes.The regulatory relationships among genes are thencharacterized by f (·). To get networks with differentlevels of sparseness, we applied different thresholds tothe connection matrix to get final edges. In our experi-ments we use Murphy’s Bayesian Network Toolbox[15] for the DBN approach and TRNinfer[5] for thedifferential equation approach; we refer to them asDBI and DEI, respectively.Refinement algorithms in our previous workThe principle of our phylogenetic approach is that phy-logenetically close organisms are likely to have similarlyclose regulatory networks; thus independent networkinference errors at the leaves get corrected in the ances-tral reconstruction process along the phylogeny. In [11],we gave two refinement algorithms, RefineFast and Refi-neML. Each uses the globally optimized parents of theleaves to refine the leaves, but the first simply picks anew leaf network by sampling from the inferred distri-bution (given the parent network, the evolutionarymodel parameters, and the phylogeny), while the secondcombines the inferred distribution with a prior, theexisting leaf network, using a precomputed belief


of 12

coefficient that indicates our confidence in the currentleaf network, and returns the most likely network underthese parameters.Inference of gene duplication and loss historyTo infer ancestral networks with the extended networkevolution model, we need a full history of gene duplica-tions and losses for the gene families. Reconciliation ofgene trees with the species tree [16-18] is one way toinfer this history. The species tree is the phylogenetictree whose leaves correspond to the modern organisms;gene duplications and losses occur along the branches ofthis tree. A gene tree is a phylogenetic tree whose leavescorrespond to genes in orthologous gene families acrossthe organisms of interest; in such a tree, gene duplica-tions and speciations are associated with internal nodes.When gene duplications and losses occur, the species

trees and the gene trees may legitimately differ in topol-ogy. Reconciling these superficially conflicting topolo-gies–that is, explaining the differences through a historyof gene duplications and losses–is known as lineagesorting or reconciliation and produces a list of geneduplications and losses along each edge in the speciestree. While reconciliation is a hard computational pro-blem, algorithms have been devised for it in a Bayesianframework [16] or using a simple parsimony criterion[17]. In our experiments, we use the parsimony-basedreconciliation tool Notung [17], but we also investigatethe effect of using alternate histories.

The evolutionary modelThe new network evolutionary modelAlthough transcriptional regulatory networks producedfrom bench experiments are available for only a fewmodel organisms, other types of data have been used toassist in the comparative study of regulatory mechan-isms across organisms [19-21]. For example, gene-expression data [21], sequence data like transcriptionalfactor binding site (TFBS) data [19,20], and cis-regula-tory elements [21] have all been used in this context.Moreover, a broad range of model organisms have beenstudied, including bacteria [7], yeast [19,21], and fruit fly[20]. While these studies offer some insights, they havenot to date sufficed to establish a clear model for regu-latory networks or their evolution. Our new modelremains simple, but can easily be generalized or asso-ciated with other models, such as the evolutionarymodel of TFBSs [22]. In this new model, the networksare represented by binary adjacency matrices.The evolutionary operations are:

• Gene duplication: a gene is duplicated with prob-ability pd. After a duplication, edges for the newlygenerated copy can be assigned as follows:

Neutral initialization: Create connectionsbetween the new copy and other genes randomlyaccording to the proportion π1 of edges in thebackground network independently of the originalcopy.Inheritance initialization: Connections of the dupli-cated copy are reported to correlate with those ofthe original copy [7-9]. This observation suggestsletting the new copy inherit the connections of theoriginal, then lose some of them or gain new onesat some fixed rate [23].Preferential attachment: The new copy gets pre-ferentially connected to genes with high connec-tivity [23,24].

• Gene loss: a gene is deleted along with all its con-nections with probability pl.• Edge gain: an edge between two genes is generatedwith probability p01.• Edge loss: an existing edge is deleted with probabil-ity p10.

The model parameters are thus pd, pl, the proportionsof 0 s and 1 s in the networks Π = (π0 π1), the substitu-

tion matrix of 0 s and 1 s, Pp p

p p

00 01

10 11, plus para-

meters suitable to the initialization model.Models of gene duplications and lossesWhile networks evolve according to the network evolu-tionary model described above, a history of gene duplica-tions and losses is created along the evolution. However,during reconstruction, this history may not be exactlyreconstructed. Therefore, we propose other models of geneduplications and losses to approximate the true history:

• The duplication-only model: We assume that dif-ferent gene contents are due exclusively to geneduplication events.• The loss-only model: We assume that different genecontents are due exclusively to gene loss events.

We also compare outcomes when the true history isknown.

The new refinement methodsWe begin by collecting the regulatory networks to berefined. These networks may have already been inferred orthey can be inferred from gene-expression data at thispoint using any of the standard network inference meth-ods. The genes in these networks are not required to beorthologous across all species, as the duplication/lossmodel allows for gene families of various sizes. Refinementproceeds in the two-phase iterative manner already


of 12

described, but adding a step for reconstruction of geneduplication and loss history and suitably modified algo-rithms for ancestral reconstruction and leaf refinement:

1. Reconstruct the history of gene duplications andlosses, from which the gene contents for the ances-tral regulatory networks (at each internal node ofthe species tree) can be determined. We presentalgorithms for history reconstruction with differentgene duplication and loss models.2. Infer the edges in the ancestral networks once wehave the genes of these networks. We do this usinga revised version of FastML.3. Refine the leaf networks with new versions of Refi-neFast and RefineML.4. Repeat steps 2 and 3 as needed.

Inferring gene duplication and loss historyThe duplication-only and loss-only models allow simpli-fying the inference of the gene duplication and loss his-tory and of the gene contents of the ancestors. For acertain internal node of the phylogenetic tree, with theduplication-only assumption, the intersection of thegenes of all the leaves in the subtree rooted at this inter-nal node is its set of genes, while with the loss-onlyassumption, the union of genes in all the leaves of thesubtree is the set of genes. Gene duplication and losshistories inferred with these methods have a minimumnumber of gene duplications, respectively losses – theyare optimal under the model.With both operations allowed, there are different ways

of getting such a history. For example, the reconciliationalgorithms introduced earlier can be used to reconstructthe history. Besides, the orthology assignment of thegene families across species can be leveraged for betterinference of the history. FastML[12], which wasdesigned to infer ancestral sequences given thesequences of a family of modern organisms, can beapplied in this case after the following preprocessing.Suppose there are N different genes in all the modernorganisms, we then represent the gene content of eachorganism with a binary sequence of length N, where thevalue at each position is assigned as 1 if the correspond-ing gene or its ortholog is present, otherwise 0. FastMLcan be used to obtain an estimate of these sequences forthe ancestral organisms, with a character set {0, 1} andthe substitution matrix:

Pp p

p phd d

l l

1

1.

Note that this approach assumes 1-1 orthologies,whereas orthology is a many-to-many relationship. Inbiological practice, however, 1-1 orthologies are by far

the most common.Inferring ancestral networksFastML[12] assumes independence among the entriesof the adjacency matrices and reconstructs ancestralcharacters one site at a time. When the gene content isthe same in all networks, FastML can be used nearlyunchanged, as in our previous work [11]. In our newmodel, however, the gene content varies across net-works. We solve this problem by embedding all net-works into one that includes every gene that appears inany network, taking the union of all gene sets. We thenrepresent a network with a ternary adjacency matrix,where the rows and columns of the missing genes arefilled with a special character x. All networks are thusrepresented with adjacency matrices of the same size.Since the gene contents of ancestral networks areknown thanks to reconciliation, the entries with x arealready identified in their matrices; other entries arereconstructed by our revised version of FastML, with anew character set S’ = {0, 1, x}. We modify the substitu-tion matrix and take special measures for x during cal-culation. The substitution matrix P’ for S’ can bederived from the model parameters, without introducingnew parameters:

P

p p p

p p p

p p p

p p

x

x

x x xx

l

00 01 0

10 11 1

0 1

1( ) 000 01

10 11

0 1

1

1 1

1

( )

( ) ( )

p p p

p p p p p

p p p

l l

l l l

d d d

.

Given P’, let i, j, k denote a tree node, and a, b, c ∈ S’possible values of a character at some node. For eachcharacter a at each node i, we maintain two variables:

• Li(a): the likelihood of the best reconstruction ofthe subtree with root i, given that the parent of i isassigned character a.• Ci(a): the optimal character for i, given that itsparent is assigned character a.

On a binary phylogenetic tree, for each site, ourrevised FastML then works as follows:

1. If leaf i has character b, then, for each a ∈ S’, setCi(a) = b and Li(a) = pab .2. If i is an internal node and not the root, its chil-dren are j and k, and it has not yet been processed,then

• if i has character x, for each a ∈ S’, set Li(a) =pax ·Lj(x). Lk(x) and Ci(a) = x;


of 12

• otherwise, for each a ∈ S’, set Li(a) = maxc∈{0,1}pac ·Lj(c)·Lk(c) and Ci(a) = argmaxc∈{0,1} pac ·Lj

(c)·Lk(c).3. If there remain unvisited nonroot nodes, return toStep 2.4. If i is the root node, with children j and k, assignit the value a ∈ {0,1} that maximizes πa·Lj(a)·Lk(a), ifthe character of i is not already identified as x.5. Traverse the tree from the root, assigning to eachnode its character by Ci(a).

Refining leaf networks: RefineFastRefineFast uses the parent networks inferred by FastMLto evolve new sample leaf networks. Because the strategyis just one of sampling, we do not alter the gene con-tents of the original leaves–duplication and loss are nottaken into account in this refinement step. Let Al andAp be the adjacency matrices of a leaf network and itsparent network, respectively, and let Al stand for therefined network for Al; then the revised RefineFast algo-rithm carries out the following steps:

1. For each entry(i, j) of each leaf network Al,• if Al(i, j) ≠ x and Ap(i, j) ≠ x, evolve Ap(i, j) by Pto get Al (i, j);• otherwise, assign Al (i, j) = Al(i, j).

2. Use the Al (i, j) to replace Al(i, j).

In this algorithm, the original leaf networks are usedonly in the first round of ancestral reconstruction, afterwhich they are replaced with the sample networks drawnfrom the distribution of possible children of the parents.Refining leaf networks: RefineMLTo make use of the prior information (in the original leafnetworks), RefineML uses a belief coefficient kb for eachentry of the adjacency matrices of these networks, whichrepresents how much we trust the prediction by the basenetwork inference algorithm. With the extended networkevolution model, the value of kbis the combination of twoitems. One is the weights of the edges given by the infer-ence algorithm, which can be calculated from the condi-tional probability table (CPT) parameters of thepredicted networks in the DBN framework. The otherdepends on the distribution of the orthologs of corre-sponding genes over other leaves. Denote the number ofleaves by Nl, and the distance between leaf i and leaf j inthe phylogenetic tree by dij, then the second item of kb ofa certain entry for leaf k can be calculated by

hidiki Nl i k

hidiki Nl i k

1

1

2 11

12

, , ,

, , ,

where hi = 0 if leaf i has the corresponding genes, hi =1 otherwise.

As in RefineFast, the refinement procedure does notalter the gene contents of the leaves. Using the samenotations as for FastML and RefineFast, RefineML aimsto find the Al which maximizes the likelihood of thesubtree between Apand Al . The revised RefineML algo-rithm thus works as follows:

1. Learn the CPT parameters for the leaf networksreconstructed by the base inference algorithm andcalculate the belief coefficient kb for every site.2. For each entry(i, j) of each leaf network Al, do:

• If Al(i, j) ≠ x and Ap(i, j) ≠ x, let a = Ap(i, j), b =Al(i, j),

(a) let Q(c) = kb if b = c, 1 - kb otherwise;(b) calculate the likelihood L(a) = maxc∈{0,1}pac·Q(c);(c) assign Al (i, j) = arg maxc∈{0,1} pac·Q(c).

• Otherwise, assign Al (i, j) = Al(i, j).3. Use Al (i, j) to replace Al(i, j).

Experimental designTo test the performance of our approach, we need regu-latory networks as the input to our refinement algo-rithms. In our simulation experiments, we evolvenetworks along a given tree from a chosen root networkto obtain the “true” leaf networks. Then, in order toreduce the correlation between generation and recon-struction of networks, we use the leaf networks to createsimulated expression data and use our preferred net-work inference method to reconstruct networks fromthe expression data. These inferred networks are thetrue starting point of our refinement procedure–we usethe simulated gene expression data only to achieve bet-ter separation between the generation of networks andtheir refinement, and also to provide a glimpse of a fullanalysis pipeline for biological data. We then comparethe inferred networks after and before refinementagainst the “true” networks (generated in the first step).Despite of the advantages of such simulation experi-

ments (which allow an exact assessment of the perfor-mance of the inference and refinement algorithms), resultson biological data are highly desirable, as such data mayprove quite different from what was generated in oursimulations. TFBS data is used to study regulatory net-works, assuming that the regulatory interactions deter-mined by transcription factor (TF) binding share manyproperties with the real interactions [19,20,25]. Given thisclose relationship between regulatory networks and TFBSsand given the large amount of available data on TFBSs, wechose to use TFBS data to derive regulatory networks forthe organisms as their “true” networks–rather than gener-ate these networks through simulation. In this fashion, weproduce datasets for the cis-regulatory modules (CRMs)for 12 species of Drosophila.


of 12

With the extended evolutionary model, conductingexperiments with real data involves several extra stepsbesides the refinement step, each of which is a potentialsource of errors. For example, assuming we have identi-fied gene families of interest, we need to build genetrees or assign orthologies for these genes to be able toreconstruct a history of duplications and losses. Anyerror in gene tree reconstruction or orthology determi-nation leads to magnified errors in the history of dupli-cations and losses. Assessing the results under suchcircumstances (no knowledge of the true networks andmany complex sources of error) is not possible, so weturned to simulation for this part of the testing. Thisdecision does not prejudice our ability to apply ourapproach to real data and to infer high-quality networks:it only reflects our inability to compute precise accuracyscores on biological data.Experiments on biological data with the basicevolutionary modelWe use regulatory networks derived from TFBS data asthe “true” networks for the organisms rather than gener-ating these networks through simulations. Such data isavailable for the Drosophila family (whose phylogeny iswell studied) with 12 organisms: D. simulans, D. sechel-lia, D. melanogaster, D. yakuba, D. erecta, D. ananassae,D. pseudoobscura, D. persimilis, D. willistoni, D. moja-vensis, D. virilis, and D. grimshawi. The TFBS data isdrawn from the work of Kim et al. [22], where theTFBSs are annotated for all 12 organisms on 51 CRMs.We conduct separate experiments on different CRMs.

For each CRM, we choose orthologous TFBS sequencesof 6 transcription factors (TFs): Dstat, Bicoid, Caudal,Hunchback, Kruppel, and Tailless, across the 12 organ-isms. Then for each organism, we can get a networkwith these 6 TFs and the target genes indicated by theTFBS sequences, where the arcs are determined by theannotation of TFBSs and the weights of arcs are calcu-lated from the binding scores provided in [22]. (In thispaper we do not distinguish TFs and target genes andcall them all “genes.”) These networks are regarded asthe “true” regulatory networks for the organisms.Gene-expression data is then generated from these

“true” regulatory networks; data is generated indepen-dently for each organism, using procedure DBNSim,based on the DBN model [11]. Following [14], DBNSimuses binary gene-expression levels, where 1 and 0 indi-cate that the gene is, respectively, on and off. Denote theexpression level of gene gi by xi, xi ∈ {0,1}; if mi nodeshave arcs directed to gi in the network, let the expres-sion levels of these nodes be denoted by the vector y =y1y2 ... ymi

and the weights of their arcs by the vector w= w1 = w2 ... wmi

. From y and w, we can get the condi-tional probability Pr(xi|y). Once we have the full para-meters of the leaf networks, we generate simulated

time-series gene-expression data. At the initial timepoint, the value of xi is generated by the initial distribu-tion Pr(xi); xi at time t is generated based on y at time t- 1 and the conditional probability Pr(xi|y). We generate100 time points of gene-expression data for each net-work in this manner. With this data we can apply ourapproach. DBI is applied to infer regulatory networksfrom the gene-expression data. The inferred networksare then refined by RefineFast and RefineML. The wholeprocedure is run 10 times to provide smoothing and wereport average performance over these runs.Experiments on simulated data with the extended modelData simulationIn these experiments, the “true” networks for the organ-isms and their gene-expression data are both generated,starting from three pieces of input information: the phy-logenetic tree, the network at the root, and the evolu-tionary model. While simulated data allows us to getabsolute evaluation of our refinement algorithms, speci-fic precautions need to be taken against systematic biasduring data simulation and result analysis. We use awide variety of phylogenetic trees from the literature (ofmodest sizes: between 20 and 60 taxa) and severalchoices of root networks, the latter variations on part ofthe yeast network from the KEGG database [26], as alsoused by Kim et al. [3]; we also explore a wide range ofevolutionary rates, especially different rates of geneduplication and loss. The root network is of modestsize, between 14 and 17 genes, a relatively easy case forinference algorithms and thus also a more challengingcase for a boosting algorithm.We first generate the leaf networks that are used as the

“true” regulatory networks for the chosen organisms.Since we need quantitative relationships in the networksin order to generate gene-expression data from each net-work, in the data generation process, we use adjacencymatrices with signed weights. Weight values are assignedto the root network, yielding a weighted adjacency matrixAp. To get the adjacency matrix for its child Ac, accordingto the extended network evolution model, we follow twosteps: evolve the gene contents and evolve the regulatoryconnections. First, genes are duplicated or lost by pdandpl. If a duplication happens, a row and column for thisnew copy will be added to Ap, the values initialized eitheraccording to the neutral initialization model or theinheritance initialization model. (We conducted experi-ments under both models.) We denote the current adja-cency matrix as Ac . Secondly, edges in Ac are mutatedaccording to p01 and p10 to get Ac. We repeat this processas we traverse down the tree to obtain weighted adja-cency matrices at the leaves, which is standard practicein the study of phylogenetic reconstruction [27,28].To test our refinement algorithms on different kinds

of data, besides DBNSim, we also use Yu’s GeneSim


of 12

[29] to generate continuous gene-expression data fromthe weighted leaf networks. Denoting the gene-expres-sion levels of the genes at time t by the vector x(t), thevalues at time t + 1 are calculated according to x(t + 1)= x(t) + (x(t) - z)C + ε, where C is the weighted adja-cency matrix of the network, the vector z representsconstitutive expression values for each gene, and ε mod-els noise in the data. The values of x(0) and xi(t) forthose genes without parents are chosen uniformly atrandom from the range [0,100], while the values of z areall set to 50. The term (x(t) - z)C represents the effectof the regulators on the genes; this term needs to beamplified for the use of DBI, because of the requireddiscretization. We use a factor ke with the regulationterm (set to 7 in our experiments), yielding the newequation x(t + 1) = x(t) + ke (x(t)- z)C + ε.Groups of experimentsWith two data generation methods, DBNSim and Gen-eSim, and two base inference algorithms, DBI and DEI,we can conduct experiments with different combina-tions of data generation methods and inference algo-rithms to verify that our boosting algorithms workunder all circumstances. First, we use DBNSim to gener-ate data for DBI. 13 × n time points are generated for anetwork with n genes, since larger networks generallyneed more samples to gain inference accuracy compar-able to smaller ones. Second, we apply DEI to datasetsgenerated by GeneSim to infer the networks. Since theDEI tool TRNinfer does not accept large datasets(with many time points), here we use smaller datasetsthan the previous group of experiments with at most 75time points. For each setup, experiments with differentrates of gene duplication and loss are conducted.For each combination of rates of gene duplication and

loss, data generation methods, and base network infer-ence methods, we get the networks inferred by DBI orDEI for the family of organisms. We then run refine-ment algorithms on each set of networks with differentgene duplication and loss histories: the duplication-onlyhistory, the loss-only history, the history reconstructedby FastML given the true orthology assignment, andthat reconstructed by Notung [17] without orthologyinformation as input. Besides, since simulation experi-ments allow us to record the true gene duplication andloss history during data generation, we can also test theaccuracy of the refinement algorithms with the true his-tory, without mixing their performance with that ofgene tree reconstruction or reconciliation. Each experi-ment is run 10 times to obtain average performance.MeasurementsWe want to examine the predicted networks at differentlevels of sensitivity and specificity. For DBI, on eachdataset, we apply different penalty coefficients to predictregulatory networks, from 0 to 0.5, with an interval of

0.05, which results in 11 discrete coefficients. For eachpenalty coefficient, we apply RefineFast and RefineMLon the predicted networks. For DEI, we also choose 11thresholds for each predicted weighted connectionmatrix to get networks on various sparseness levels. Foreach threshold, we apply RefineFast on the predictednetworks. We measure specificity and sensitivity to eval-uate the performance of the algorithms and plot thevalues, as measured on the results for various penaltycoefficients (for DBI) and thresholds (for DEI) to yieldROC curves. In such plots, the larger the area under thecurve, the better the results.

Results and analysisResults on biological data with the basic evolutionarymodelWe conducted experiments on different CRMs of the 12Drosophila species; here we show results on two ofthem. In both experiments, regulatory networks have 6TFs and 12 target genes, forming networks with 18nodes. Average performance for the base inference algo-rithm (DBI) and for the two refinement algorithms over10 runs for these two experiments is shown in Fig. 1using ROC curves. In the two plots, the points on eachcurve are obtained with different structure complexitypenalty coefficients. From Fig. 1 we can see theimprovement of our refinement algorithms over thebase algorithm is significant: RefineML improves signifi-cantly both sensitivity and specificity, while RefineFastloses a little sensitivity while gaining more specificity forsparse networks. (In both CRMs, the standard deviationon sensitivity is around 0.05 and that on specificityaround 0.005.) The dominance of RefineML over Refine-Fast shows the advantage of reusing the leaf networksinferred by base algorithms, especially when the errorrate in these leaf networks is low. Results on otherCRMs show similar improvement of our refinementalgorithms. Besides the obvious improvement, we canalso observe the fluctuation of the curves: theoreticallysensitivity can be traded for specificity and vice versa, sothat the ROC curves should be in “smooth” shapes,which is not the case for Fig. 1. Various factors canaccount for this: the shortage of gene-expression data toinfer the network, the noise inherent in biological data,the special structure of the networks to be inferred, orthe relatively small amount of data involved, leading tohigher variability. (We have excluded the first possibilityby generating larger gene-expression datasets for infer-ence algorithms, where similar fluctuations still occur.)We analyze the difference level between the “true”

networks of the 12 organisms, to obtain a view of theevolutionary rate of regulatory networks in our datasets.For each CRM, we take the union of the edges in all 12networks, classify these edges by the number of


of 12

networks in which they are present, and calculate theproportion of each category. The overall proportions onall CRMs are shown in Table 1; a large fraction of edgesare shared by less than half of the organisms, meaningthat the networks are quite diverse. Therefore, theimprovement brought by the refinement algorithms isdue to the use of phylogenetic information rather thanthe averaging effect of trees.Results with extended evolutionary modelWe used different evolutionary rates to generate thenetworks for the simulation experiments. In [11] wetested mainly edge gain or loss rates; here we focus ontesting different gene duplication and loss rates. We alsoconducted experiments on various combinations ofgene-expression data generation methods and networkinference methods. The inferred networks were thenrefined by refinement algorithms with different modelsof gene duplications and losses. We do not directlycompare the extended model with the basic, as the twodo not lend themselves to a fair comparison – forinstance, the basic model requires equal gene contentsacross all leaves, something that can only be achieved byrestricting the data to a common intersection, therebycatastrophically reducing sensitivity.Since the results of using neutral initialization and

inheritance initialization in data generation are verysimilar, we only show results with the neutral initializa-tion model. We first refine networks with the true gene

duplication and loss history to test the pure perfor-mance of the refinement algorithms, then we presentand discuss the results of refinement algorithms withseveral other gene evolution histories, which are moresuitable for the application on real biological data. Allresults we show below are averages over 10 runs.Refine with true history of gene duplications and lossesIn Fig. 2, we show the results of the experiments withDBNSim used to generate gene-expression data, andDBI as base inference algorithm. All results with DBIinference that we show are on one representative phylo-genetic tree with 35 nodes on 7 levels, and the root net-work has 15 genes. The left plot has a relatively highrate of gene duplication and loss (resulting in 20 dupli-cations and 23 losses along the tree), while the right onehas a slightly lower rate (with 19 duplications and 15losses), again averaged over 10 runs.Given the size of the tree and the root network, these

are high rates of gene duplication and loss, yet, as wecan see from Fig. 2, the improvement gained by ourrefinement algorithms remains clear in both plots, whileRefineML further dominates RefineFast in both sensitiv-ity and specificity, thanks to the appropriate reuse of theinferred leaf networks. In the experiments with DEI net-work inference, GeneSim is used to generate continu-ous gene-expression data. In these experiments, the rootnetwork has 14 genes, and the phylogenetic tree has 37nodes on 7 levels. The average performance of DEI and

Figure 1 Performance of refinement algorithms on Drosophila data, with basic network evolution model. (A) Results on CRM abd-A_iab-2_1.7_; (B) Results on CRM Abd-B_IAB5.

Table 1 The proportion of edges shared by different numbers of species

Number of species 1 2 3 4 5 6 7 8 9 10 11 12

Proportion of edges 0.19 0.18 0.03 0.07 0.03 0.09 0.03 0.09 0.02 0.07 0.07 0.13

Cumulative fraction 0.19 0.37 0.40 0.47 0.50 0.59 0.62 0.71 0.73 0.80 0.87 1.00


of 12

RefineFast over 10 runs is shown in Fig. 3. We alsoshow results for two different evolutionary rates: Fig. 3(A) has higher gene duplication and loss rates, resultingin 15 duplications and 7 losses, while datasets in Fig. 3(B) have an average of 8 duplications and 3 losses. TheDEI tool aims to infer networks with small gene-expres-sion datasets. RefineFast significantly improves the per-formance of the base algorithm, especially thesensitivity. (Sensitivity for DEI is poor in these experi-ments, because of the inherent lower sensitivity ofTRNinfer, as seen in [11] and also because of thereduced size of the gene-expression datasets.) Since thedifference between the gene duplication and loss ratesin Fig. 3(A) and Fig. 3(B) is large, we can observe moreimprovement in Fig. 3(B), which has lower rates. This isbecause high duplication and loss rates give rise to alarge overall gene population, yet many of them existonly in a few leaves, so that there is not much phyloge-netic information to be used to correct the prediction ofthe connections for these genes.Refine with duplication-only and loss-only historiesWe have seen from Figs. 2 and 3 that our two refine-ment algorithms improve the networks inferred by bothDBI and DEI. Since the accuracy of DBI is much betterthan that of DEI, which causes more difficulty for refine-ment algorithms, and since RefineML does clearly betterthan RefineFast, hereafter we only show results withDBI inference and RefineFast refinement, which are onthe same datasets as used in Fig. 2.Fig. 4 shows the comparison of the performance of

DBI and RefineFast with respectively the true geneduplication and loss history, the duplication-only historyand the loss-only history assuming correct orthologyassignment. The duplication-only and loss-only

assumptions are at the opposite (and equally unrealistic)extremes of possible models of gene family evolution –their only positive attribute is that they facilitate thereconstruction of that evolution. Yet we see that Refine-Fast still improves the base network inference algorithmwith both models. The performance of the duplication-only history differs between Fig. 4(A) and Fig. 4(B): inFig. 4(A), it does worse than the true history and theloss-only history, while in Fig. 4(B), its performance iscomparable with the other two. This is because thereare more gene losses than gene duplications in the leftplot, but more gene duplications than gene losses in theright plot, which the duplication-only history matchesbetter. The performance of the loss-only history appearsto be steady and not much affected by different evolu-tionary rates.Refine with inferred histories of gene duplications andlossesIn Fig. 5, we show the performance of refinement algo-rithm with various inferred gene duplication and losshistories, compared to that with the true history.FastML is applied to infer history with correct orthol-ogy information as described earlier. To test the value ofhaving good orthology information, we also assignorthologies at random and then use FastML to inferancestral gene contents. In each run, the refinementprocedure with this history is repeated 20 times to getaverage results over 20 random orthology assignments.Finally, we use Notung to reconstruct a gene duplicationand loss history without orthology input; Notung notonly infers the gene contents for ancestral networks, butalso alters the gene contents of the leaves.In both Fig. 5(A) and Fig. 5(B) the FastML recon-

structed history with correct orthology does as well as

Figure 2 Performance with extended evolution model and DBI inference method, and true history of gene duplications and losses. (A)Results with higher gene duplication and loss rates; (B) Results with lower gene duplication and loss rates.


of 12

the true history. In fact, the history is very accuratelyreconstructed, which explains why the two curves agreeso much. However, with the history reconstructed byFastML under random orthology assignments, therefinement algorithm only improves slightly over thebase algorithm. With Notung inference RefineFast stilldominates DBI in Fig. 5(B), but not in Fig. 5(A) whichhas higher evolutionary rates.On using histories of gene duplications and losses, andorthology assignmentsOur experiments with various evolutionary histories leadto several conclusions:

1. Good orthology assignments are important.2. When we have good orthology assignments, therefinement algorithms need not rely on the true his-tory of gene duplications and losses. We can use theloss-only history or the history reconstructed byFastML, both of which are easy to build and lead toperformance similar to that of the true history.

Conclusions and future workWe presented a model, associated algorithms, andexperimental results to test the hypothesis that a morerefined model of transcriptional regulatory network evo-lution would support additional refinements in accuracy.Specifically, we presented a new version of our evolu-tionary approach to refine the accuracy of transcrip-tional regulatory networks for phylogenetically relatedorganisms, based on an extended network evolutionmodel, which takes into account gene duplication andloss. As these events are thought to play a crucial rolein evolving new functions and interactions [8,9], inte-grating them into the model both extends the range ofapplicability of our refinement algorithms and enhancestheir accuracy. Furthermore, to give a comprehensiveanalysis of the factors which affect the performance ofthe refinement algorithms, we conducted experimentswith different histories of gene duplications and losses,and different orthology assignments. Results of experi-ments under various settings show the effectiveness of

Figure 3 Performance with extended evolution model and DEI inference method, and true history of gene duplications and losses. (A)Results with higher gene duplication and loss rates; (B) Results with lower gene duplication and loss rates.

Figure 4 Performance with extended evolution model and DBI inference method, with duplication-only and loss-only histories. (A)Results with higher gene duplication and loss rates; (B) Results with lower gene duplication and loss rates.


of 12

our refinement algorithms with the new model through-out a broad range of gene duplications and losses.We also collected regulatory networks from the TFBS

data of 12 Drosophila species and applied our approach(using the basic model), with very encouraging results.These results confirm that phylogenetic relationshipscarry over to the regulatory networks of a family oforganisms and can be used to improve the networkinference and to help with further analysis of regulatorysystems and their dynamics and evolution.Our positive results with the extended network evolu-

tion model show that refined models can be used ininference to good effect. Our current model can itself berefined by using the widely studied evolution of TFBS[20,22].

AcknowledgementsA preliminary version of this paper appeared in the proceedings of the 9th

Workshop on Algorithms in Bioinformatics WABI’09 [30].

Authors’ contributionsXZ designed, implemented, and ran all simulation and refinement methods;XZ and BMEM collaborated closely in the elaboration of the project, theanalysis of the data, and the writing of the manuscript.

Competing interestsThe authors declare that they have no competing interests.

Received: 10 August 2009Accepted: 4 January 2010 Published: 4 January 2010

References1. Akutsu T, Miyano S, Kuhara S: Identification of genetic networks from a

small number of gene expression patterns under the Boolean networkmodel. Proc. 4th Pacific Symp. on Biocomputing (PSB’99), World Scientific1999, 4:17-28.

2. Friedman N, Linial M, Nachman I, Pe’er D: Using Bayesian Networks toAnalyze Expression Data. J Comput Bio 2000, 7(3-4):601-620.

3. Kim SY, Imoto S, Miyano S: Inferring gene networks from time seriesmicroarray data using dynamic Bayesian networks. Briefings inBioinformatics 2003, 4(3):228-235.

4. Chen T, He HL, Church GM: Modeling gene expression with differentialequations. Proc. 4th Pacific Symp. on Biocomputing (PSB’99), World Scientific1999, 29-40.

5. Wang R, Wang Y, Zhang X, Chen L: Inferring Transcriptional RegulatoryNetworks from High-throughput Data. Bioinformatics 2007, 23(22):3056-3064.

6. Babu MM, Luscombe NM, Aravind L, Gerstein M, Teichmann SA: Structureand evolution of transcriptional regulatory networks. Curr Opinion inStruct Bio 2004, 14(3):283-291.

7. Babu MM, Teichmann SA, Aravind L: Evolutionary Dynamics of ProkaryoticTranscriptional Regulatory Networks. J Mol Bio 2006, 358(2):614-633.

8. Teichmann SA, Babu MM: Gene regulatory network growth byduplication. Nature Genetics 2004, 36(5):492-496.

9. Roth C, Rastogi S, Arvestad L, Dittmar K, Light S, Ekman D, Liberles DA:Evolution after gene duplication: models, mechanisms, sequences,systems, and organisms. Journal of Experimental Zoology Part B: Molecularand Developmental Evolution 2007, 308B:58-73.

10. Bourque G, Sankoff D: Improving gene network inference by comparingexpression time-series across species, developmental stages or tissues. JBioinform Comput Biol 2004, 2(4):765-783.

11. Zhang X, Moret BME: Boosting the Performance of Inference Algorithmsfor Transcriptional Regulatory Networks Using a Phylogenetic Approach.Proc. 8th Int’l Workshop Algs. in Bioinformatics (WABI’08) Lecture Notes inComputer Science, Springer 2008, 5251:245-258.

12. Pupko T, Pe’er I, Shamir R, Graur D: Fast Algorithm for JointReconstruction of Ancestral Amino Acid Sequences. Mol Bio Evol 2000,17(6):890-896.

13. Friedman N, Murphy KP, Russell S: Learning the structure of dynamicprobabilistic networks. Proc. 14th Conf. on Uncertainty in Art. Intell. UAI’981998, 139-147.

14. Liang S, Fuhrman S, Somogyi R: REVEAL, a general reverse engineeringalgorithm for inference of genetic network architectures. Proc. 3rd PacificSymp. on Biocomputing (PSB’98), World Scientific 1998, 3:18-29.

15. Murphy KP: The Bayes Net Toolbox for MATLAB. Computing Sci andStatistics 2001, 33:331-351.

16. Arvestad L, Berglund AC, Lagergren J, Sennblad B: Gene treereconstruction and orthology analysis based on an integrated model forduplications and sequence evolution. Proc. 8th Ann. Int’l Conf. Comput.Mol. Bio. (RECOMB’04) New York, NY, USA: ACM 2004, 326-335.

17. Durand D, Halldórsson BV, Vernot B: A Hybrid Micro-MacroevolutionaryApproach to Gene Tree Reconstruction. J Comput Bio 2006, 13(2):320-335.

18. Page RDM, Charleston MA: From Gene to Organismal Phylogeny:Reconciled Trees and the Gene Tree/Species Tree Problem. MolecularPhylogenetics and Evolution 1997, 7(2):231-240.

19. Crombach A, Hogeweg P: Evolution of Evolvability in Gene RegulatoryNetworks. PLoS Comput Biol 2008, 4(7):e1000112.

20. Stark A, Kheradpour P, Roy S, Kellis M: Reliable prediction of regulatortargets using 12 Drosophila genomes. Genome Res 2007, 17:1919-1931.

21. Tanay A, Regev A, Shamir R: Conservation and evolvability in regulatorynetworks: The evolution of ribosomal regulation in yeast. Proc Nat’l AcadSci USA 2005, 102(20):7203-7208.

22. Kim J, He X, Sinha S: Evolution of Regulatory Sequences in 12 DrosophilaSpecies. PLoS Genet 2009, 5:e1000330.

Figure 5 Performance with extended evolution model and DBI inference method, with inferred mixture histories. (A) Results with highergene duplication and loss rates; (B) Results with lower gene duplication and loss rates.


of 12

http://www.ncbi.nlm.nih.gov/pubmed/14582517?dopt=Abstract
















23. Bhan A, Galas DJ, Dewey TG: A duplication growth model of geneexpression networks. Bioinformatics 2002, 18(11):1486-1493.

24. Barabási AL, Oltvai ZN: Network biology: understanding the cell’sfunctional organization. Nat Rev Genet 2004, 5:101-113.

25. Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW,Hannett NM, Tagne JB, Reynolds DB, Yoo J, Jennings EG, Zeitlinger J,Pokholok DK, Kellis M: Transcriptional regulatory code of a eukaryoticgenome. Nature 2004, 431:99-104.

26. Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S,Katayama T, Araki M, Hirakawa M: From genomics to chemical genomics:new developments in KEGG. Nucleic Acids Res 2006, 34:D354-D357.

27. Hillis DM: Approaches for assessing phylogenetic accuracy. Syst Bio 1995,44:3-16.

28. Moret BME, Warnow T: Reconstructing optimal phylogenetic trees: Achallenge in experimental algorithmics. Experimental Algorithmics LectureNotes in Computer Science. Springer VerlagFleischer R, Moret B, Schmidt E2002, 2547:163-180.

29. Yu J, Smith VA, Wang PP, Hartemink AJ, Jarvis ED: Advances to Bayesiannetwork inference for generating causal networks from observationalbiological data. Bioinformatics 2004, 20(18):3594-3603.

30. Zhang X, Moret BME: Improving Inference of Transcriptional RegulatoryNetworks Based on Network Evolutionary Models. Proc. 9th Int’l WorkshopAlgs. in Bioinformatics (WABI’09) Lecture Notes in Computer Science,Springer 2009, 5724:412-425.

doi:10.1186/1748-7188-5-1Cite this article as: Zhang and Moret: Refining transcriptional regulatorynetworks using network evolutionary models and gene histories.Algorithms for Molecular Biology 2010 5:1.

Publish with BioMed Central and every scientist can read your work free of charge

"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."

Sir Paul Nurse, Cancer Research UK

Your research papers will be:

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours — you keep the copyright

Submit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.asp

BioMedcentral


of 12












http://www.biomedcentral.com/

http://www.biomedcentral.com/info/publishing_adv.asp

http://www.biomedcentral.com/

refining transcriptional regulatory networks using network

Documents