supplementary information predictive regulatory models in...

Supplementary Information

Predictive regulatory models in Drosophila melanogaster byintegrative inference of transcriptional networks

Daniel Marbach♣, Sushmita Roy♣, Ferhat Ay, Patrick E. Meyer, Tamer Kahveci,Christopher A. Bristow, and Manolis Kellis

Companion Website

compbio.mit.edu/flynet Data and source code to reproduce networks and validation results(also included in Supplementary File 9).

Supplementary Figures

S1 Fraction of integrative network edges with and without physical support . . . . . 4

S2 Degree distributions of integrative networks . . . . . . . . . . . . . . . . . . . . . 5

S3 Structural properties of integrative networks . . . . . . . . . . . . . . . . . . . . . 6

S4 Network motifs of integrative networks . . . . . . . . . . . . . . . . . . . . . . . . 7

S5 Prediction of novel functional annotations . . . . . . . . . . . . . . . . . . . . . . 8

S6 Expression prediction for eight representative target genes . . . . . . . . . . . . . 9

S7 Squared errors for expression prediction using integrative and physical networks . 10

S8 Correlation of predictable and unpredictable genes across two time courses . . . . 11

S9 Definition of chromatin profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

S10 Relative importance of input features for the unsupervised integrative network . 13

Supplementary Tables

S1 Performance for gene expressions prediction . . . . . . . . . . . . . . . . . . . . . 14

Supplementary Notes

S1 Network motifs for integrative and mixed regulatory networks . . . . . . . . . . . 15

1

http://compbio.mit.edu/flynet

Supplementary Files

S1 integrative regulatory networks.zip This zip file contains two tab-separated files

representing the unsupervised and supervised integrative networks. The files have three

columns: the first column is the name of the regulator, the second column is the name of

the target gene, and the third column is the score of the edge connecting the regulator

and the target. Genes are given by their FBgn numbers.

S2 mixed regulatory networks.zip The mixed transcriptional and post-transcriptional

(TFs and microRNA) regulatory networks used for the network motif analysis described

in Supplementary Note S1.

S3 struct properties sup network.zip Sheet 1: Summary of the number of nodes and

edges in the supervised integrative network and the mixed regulatory network. Sheet 2:

Transcription factors in the supervised integrative and their in/out-degrees. Sheet 3:

miRNAs in the mixed regulatory network and their in/out-degrees. Sheet 4: In-degrees

of target genes and transcription factors of the supervised integrative network.

S4 struct properties unsup network.zip Sheet 1: Summary of the number of nodes and

edges in the unsupervised integrative network and the mixed regulatory network. Sheet

2: Transcription factors in the unsupervised integrative network and their

in/out-degrees. Sheet 3: miRNAs in the mixed regulatory network and their

in/out-degrees. Sheet 4: In-degrees of target genes and transcription factors of the

unsupervised integrative network.

S5 GO enrichment by node degree.xls GO biological process enrichments for high

(> 150) and low (= 1) in degree gene sets.

S6 triads.xlsx Sheet 1: List of all possible triads (subnetworks with only 3 nodes) for

mixed regulatory networks. Sheet 2: List of all possible triads (subnetworks with only 3

nodes) for the integrative networks.

S7 gene function prediction.zip This archive has two files: One tgz file and sub-directories

for the supervised and one for the unsupervised integrative network. Within each

sub-directory is a file per GO process. Each line has four columns: first column is the

name of the process, second column is the FBgn number of a gene, third column is the

FDR, and fourth column is a label to identify the type of prediction. Specifically the

labels take the values NEW (previously unannotated gene), KNOWN PRED (additional

new prediction for a previously annotated gene), KNOWN HIT (previously annotated

with the same function, that is, true positive). Only predictions that satisfy an FDR of

2

0.6 or lower and that are not in the FALSE set are included.

S8 ImaGO enrichment for predicted GO terms.xls ImaGO enrichments for predicted

GO terms for both unannotated genes and genes with previous annotations.

S9 datasets and source code.zip Data and source code to reproduce networks and

validation results.

3

Supplementary Figures

0 0.05 0.1 0.15 0.2 0.25 0.3

Supervised integrative network

Fraction of edges considered0 0.05 0.1 0.15 0.2 0.25 0.3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Unsupervised integrative network

Frac

tion

RED

fly e

dges

reco

vere

d

Fraction of edges considered

TotalNot physical

Motif only

Motif & ChIP

ChIP only

Total

Not physical

Motif only

Motif & ChIP

ChIP only

Figure S1: Fraction of integrative network edges with and without physical supportRecovery of known REDfly edges at different thresholds for the two integrative networks. Theblack curve (total) shows the fraction of the 204 REDfly edges recovered versus the fraction ofpredicted interactions considered for the integrative networks (cf. Fig. 2B). The area betweenthe colored curves shows the fraction of these recovered edges that are supported by ChIP only,ChIP and conserved motifs, conserved motifs only, and functional datasets only (chromatin andexpression). The majority of recovered edges are supported by ChIP and/or conserved motifs,especially the edges at the top of the prediction lists. However, at lower ranks some REDflyedges are also recovered based on functional data alone.

4

1

10

100

1000

10000

1 10 100

Freq

uenc

y

In-degree

In-degree distributions

Freq

uenc

y

Out-degree

Out-degree distributions

Integrative (supervised)Integrative (unsupervised)

Integrative (supervised)Integrative (unsupervised)

0.001

0.01

0.1

1

10

100

1 10 100 1000

Figure S2: Degree distributions of integrative networksNumber of targets per TF (out-degree) and number of regulators per target (in-degree) for thesupervised and unsupervised integrative networks, plotted on a log-log scale. Both networksshow roughly a scale-free distribution for both in-degrees and out-degrees, with a small numberof regulators showing many targets, and a small number of genes targeted by many regulators.

5

A B

Paramater Unsupervised network Supervised networkClustering Coefficient 0.58 0.48Network Diameter 8 8Characteristic Path Length 2.82 2.66Average Number of Neighbors 46.86 48.74Number of Connected Components 4 1

C

0.0001

0.001

0.01

0.1

1

1 10 100 1000 10000

C(k

) (lo

cal c

lust

erin

g co

effic

ient

)

k (degree)

Target GenesTFs

Target GenesTFs

0.0001

0.001

0.01

0.1

1

1 10 100 1000 10000

C(k

) (lo

cal c

lust

erin

g co

effic

ient

)

k (degree)

Supervised network Unsupervised network

Figure S3: Structural properties of integrative networks(A,B) Log-log plot of node degree (k) versus average clustering coefficient (C(k)) for the super-vised and unsupervised integrative networks. Target genes show higher clustering coefficients asthey are only connected to TFs, which themselves are highly interconnected leading to high C(k)values for target genes. Target genes with the highest in-degrees (most heavily regulated) showdecreased clustering coefficients, as their many regulators are less likely to be interconnected.In contrast, TFs are connected to both TFs and target-only genes (which themselves are notinterconnected), resulting in overall lower coefficients. However, the same trend remains, wherethe most heavily-interconnected TFs show decreased clustering coefficients, as a higher pro-portion of their targets include predominantly target-only genes, which are not interconnectedamong themselves. (C) Clustering coefficients, networks diameters, characteristic path lengths,and number of connected components of the integrative networks.

6

A BNetwork Motif Statistical Significance

Description Illustration Fold Enr. Z-score

Cross-regulating TFsco-targeting a miRNA 45.24 40.73(2 FFLs)

67.89 76.95

Cross-regulating TFswith a feedback from 3.79 8.74a regulated miRNA(1 FBL, 1 FFL) 2.24 7.51

Cross-regulating TFsco-targeted by a miRNA 1.95 5.91(2 FFLs)

1.94 8.76

Feed-forward loop from amiRNA involving a TF in 1.51 7.27regulation of a target gene(1 FFL) 1.48 11.01

Feedback loop includinga miRNA 1.66 2.22(1 FBL)

1.45 2.7

Cross-regulating TFsco-targeting a miRNA 1.57 4.33with one feedback(3 FFLs, 1 FBL) 1.41 8.52

Cross-regulating TFsco-targeted by a miRNA 1.39 1.47with one feedback(3 FFLs, 1 FBL) 1.34 1.33

Network Motif Statistical Significance

Description Illustration Fold Enr. Z-score

Cross-regulating TFsco-targeting another TF 17.919 104.23(Double FFL)

23.5 238.43

Cross-regulatoryclique of TFs 2.891 10.65(Six FFLs)

14.669 13.93

Cross-regulating TFsco-targeting a target gene 1.594 69.01(Double FFL)

2.368 125.43

Cross-regulating TFsco-targeted by another TF 1.989 23.72(Double FFL)

1.725 38.3

Cross-regulating TFscreating a feed-forward 1.349 7.52and a feedback loop

1.439 16.55

Feedback loop between 1.537 3.24three TFs

1.154 2.62

Unsupervised networkSupervised network

miRNATranscription factorTarget gene

Figure S4: Network motifs of integrative networks(A) Most frequent three-node motifs and their respective fold enrichment and Z-score occurringin the mixed (transcriptional and post-transcriptional) and (B) in the integrative networksalone.

7

A B

0.001

0.01

0.1

1

0.1 0.2 0.3 0.4 0.50FDR

0.1 0.2 0.3 0.4 0.50FDR

0.1 0.2 0.3 0.4 0.50FDR

0.1 0.2 0.3 0.4 0.50FDR

C D

Frac

tion

of g

enes

ann

otat

ed 0.001

0.01

0.1

1

Frac

tion

of g

enes

ann

otat

ed

0.0001

0.001

0.01

0.1

1

Frac

tion

of g

enes

ann

otat

ed

0.001

0.01

0.1

1

Frac

tion

of g

enes

ann

otat

ed

Unannotated genesPreviously annotated genes

Unannotated genesPreviously annotated genes

Uns

upve

rvis

ed n

etw

ork

ExpressionNetwork (unsupervised)Combined

Supv

ervi

sed

netw

ork




Figure S5: Prediction of novel functional annotationsThe number of newly predicted GO process annotations at different FDR cutoffs for genesthat are already annotated with other functions (A,C) and for genes that have no previousannotations (B,D). Predictions based on the unsupervised integrative network are shown in thefirst row (A,B), predictions based on the supervised integrative network are shown in the secondrow (C,D).

8

A Regulators Time-courseB

−1

0

1

2

log

expr

essi

on

Su(var)2−HP2MedCG18011CG11902chmTruePredicted

−1

0

1

2

log

expr

essi

on

aopCG2116CG8944crpCG17802

aop

CG2116

CG8944

crp

CG17802

TruePredicted

−2

−1

0

1

2

log

expr

essi

on

CG2116mxcCG14965CG17359lid

CG2116

mxc

CG14965

CG17359

lid

TruePredicted

−1

0

1

2

log

expr

essi

on

osaCG1845CG1792CG1620Ssrp

CG4360

pita

CG9705

Meics

row

TruePredicted

CG4360pitaCG9705MeicsrowTruePredicted

2

−1

0

1

log

expr

essi

on

−2

−1

0

1

2

log

expr

essi

on

psqphoEcRexdCG17568

psq

pho

Ecr

exd

CG17568

TruePredicted

−1

0

252015Time

1050 30

1

2

log

expr

essi

on

CG7839CG2875CG10565CG12744Cenp−C

CG7839

CG2875

CG10565

CG12744

Cenp−C

TruePredicted

Su(var)2−HP2

Med

CG18011

CG11902

chm

0.089

0.081

0.078

0.077

0.074

CG8290

Rox8

0.166

0.092

0.086

-0.083

-0.080

-0.485

0.464

-0.454

0.424

0.36

CG4497

0.214

-0.212

0.181

0.172

0.164

Ranpb21

osa

CG1845

CG1792

CG1620

Ssrp

0.259

0.254

-0.253

0.234

0.217

CG1518

0.323

0.126

0.122

0.119

0.111

0.215

0.198

0.176

0.146

-0.145

CG9300

AGO1

−2

−1

0

1

log

expr

essi

on

CG7045BEAF−32Myb

CG7045

BEAF−32

Myb

TruePredicted0.891

-0.312

0.252

CG13032

Figure S6: Expression prediction for eight representative target genes(A) The top 5 regulators and the magnitude of the regression coefficients (CG13032 has onlythree regulators). (B) True and predicted expression profiles for the target genes across thedevelopmental time course. The expression profiles of the 5 regulators are also shown.

9

0 5 10 15 20−0.04

−0.02

0

0.02

0.04

0.06

0.08

0.1

Squared Error

Prop

ortio

n of

gen

es

Integrative (unsupervised)Integrative (supervised)Conserved motifsTF binding (ChIP)

Figure S7: Squared errors for expression prediction using integrative and physicalnetworksExpression prediction was performed independently using the two integrative networks and thetwo physical networks based on TF binding (ChIP) and conserved motifs. For each of the fournetworks, we estimated the squared error between the true and predicted expression profiles,discretized the error into 20 bins, and estimated the proportion of genes with error less than thevalue corresponding to a particular bin. We did the same for 10 randomized networks, obtainedan average proportion from randomized networks, and computed the difference between thesetwo proportions (y-axis). In contrast to the two physical networks (ChIP and conserved motifs),the integrative networks show an excess of genes that are predicted accurately (low squarederror) compared to predictions based on randomized networks.

10

0

0.2

0.4

0.6

0.8

1

Pearson CC Pearson CC

Cum

ulat

ive

dist

ribut

ion

func

tion

Unsupervised integrative network Supervised integrative network

Predictable genesUnpredictable genes

0-0.5 0.5 1

-0.6 −0.2 0.2 0.6 1

Figure S8: Correlation of predictable and unpredictable genes across two timecoursesDistribution of Pearsons correlation of predictable and unpredictable genes across the two devel-opmental time courses. Predictable genes showed a significantly higher correlation between thetwo datasets compared to the remaining genes (P-value < 1E-7), suggesting that predictablegenes are also more reproducible.

11

1kb upstream TSS 5’ UTR Coding sequence 3’ UTR

TSS

1kb downstream TES

1 1 1 0 0 0 0 1 1 1

Binary feature vectors

Mark AMark B

Chromatin mark A Chromatin mark B

Figure S9: Definition of chromatin profilesChromatin profiles were defined by binning each gene into five regions and constructing binaryvectors indicating the presence of each mark in these regions. The complete chromatin profileof a gene is obtained by concatenating the profiles of all marks for samples (time points or celllines) into a single binary vector.

12

6%

27%

9%15%

19%

23%

TF binding (ChIP)

Conserved motifs

Chromatin TC

Chromatin CL

Microarray TC

RNAseq TC

Figure S10: Relative importance of input features for the unsupervised integrativenetworkThe pie chart shows the relative importance of the different input features in the unsupervisedintegrative network, analogous to Fig. 2C for the supervised network.

13

Supplementary Tables

10-fold cross-validation in time course Hold-out datasets in cell lines

Network #TFs with expr.

#Targets with expr.

#Pred. better than random

#Pred. worse than random

Enrichm. vs. random

Predictable genes

Unpredictable genes

Pred. genes(RNAseq)

Pred. genes(Mircroarray)

Binding 72 10,248 1,707 1,642 1.04 656 6% 1,030 10% 296 45% Motifs 101 9,065 1,055 977 1.08 411 5% 214 2% 215 52% REDfly 74 82 13 3 3.91 12 15% 0 0% 4 33% Integrative (sup.) 541 8,433 1,988 787 2.53 1,502 18% 323 4% 992 66%

Integrative (unsup.) 582 10,088 2,446 1,202 2.03 1,711 17% 629 6% 1,138 67%

843 56%

1,118 65%

Table S1: Performance for gene expressions predictionShown are the number of TFs, targets, and predictable targets in the developmental timecourse and cell lines for the integrative, physical, and REDfly networks. The first two columnsspecify the number of TFs and targets that are in the expression datasets and used for predictingexpression. The next four columns report cross-validation performance compared to randomizednetworks in the developmental time course. #Predictions better/worse than random specifiesthe average number of genes with significantly lower/higher error than randomized networks.The next two columns report the number of predictable genes and unpredictable genes, definedby genes that we can predict expression better (or worse) in 60% of the randomized networks.The last four columns report the number and percentage of predictable genes that remainpredictable in the two cell-line hold-out datasets.

14

Supplementary Notes

S1 Network motifs for integrative and mixed regulatory networks

Results

We sought to reveal potentially meaningful network components by studying over-representedsubgraphs known as network motifs1 (see below for methodological details). Given the impor-tance of microRNA (miRNA) regulation in feedback and repression in animal development, wecomplemented our transcriptional regulatory networks with post-transcriptional information onmiRNA regulation. We included regulatory edges from miRNAs to their target genes from aprevious study,3 where they were identified using evolutionary-conserved 7-mer miRNA seedmotifs in three prime untranslated regions for 108 conserved microRNAs in 79 distinct fam-ilies. Regulatory edges from TFs to miRNAs were identified based on binding of the TF inexperimentally-inferred promoter regions for the precursor transcripts of miRNAs.2,3 The re-sulting combined pre- and post-transcriptional regulatory network contains 281 TF-to-miRNAedges and 599 miRNA-to-TF edges (out of a total of 6,141 miRNA-to-target edges).

We found 13 motifs in our mixed network that show significant fold enrichments compared torandomized networks (Figure S4), of which seven contain miRNA regulators. Four of themotifs in Figure S4A correspond to miRNA-mediated feedback loops (M2, M5, M6, M7),which may act as a noise buffer and prevent random switching events, as shown for c-Myc,E2F1, and miR-17-20 in human.5 Three motifs constitute of only feed-forward loops (FFLs)(M1, M3, M4). Among these, the FFL in motif M4 is initiated by a miRNA and demonstratesco-operation between a TF and a miRNA to regulate a common target gene; motif M4 wasalso found to be enriched in human and mouse regulatory networks.4 Motif M3 correspondsto a double-FFL that can provide memory effect by converting transient signals of a commonupstream regulator into states that persist after initial induction, which can be particularlyimportant in developmental processes where decisions of cell fate occur1,7 (indeed, enrichedGO terms for M3 target nodes include developmental processes, p-value < 10−10). Lastly, M1,which has two TFs cross-regulating each other and both targeting a common miRNA, was alsosignificantly enriched in human regulatory networks.7

Overall, the inferred regulatory networks of D. melanogaster are enriched in similar networkmotifs as mammalian regulatory networks, suggestive of robustness, feedback, and co-operationbetween the transcriptional and post-transcriptional layers.

Methods

In order to identify building blocks of the regulatory networks, we considered all possible directedtriads (i.e., connected three node sub-networks) of these networks. Among these triads, networkmotifs are the ones that appear significantly more compared to an ensemble of random networkswith the same degree distribution as the original network (see Network randomization below).We used two different measures to assess the statistical significance of each triad, namely foldenrichment and z-score. We calculate the fold enrichment as the ratio of the percentage of each

15

triad in the original network to its average percentage in the randomized networks. Formally,for triad i the fold enrichment:

Fi =N ∗ Por∑N

j=1 Pj

(1)

where Por is the percentage of triad i in original network, and P1 to PN are the percentagesof triad i in N random networks. We compute these percentages and the z-scores for eachtriad using an existing network motif identification tool, FANMOD.6 We sort all possible triadsaccording to their fold enrichment values calculated using 100 random networks generated byFANMOD. The top triads of this list are the most important network motifs in the originalnetwork. We also report the z-score, the number of appearances and the standard deviation ofeach network motif to better understand the portion of the original network that is explainedby these conserved connectivity patterns (see Fig. S4).

We searched for network motifs in two different settings for each inferred network. First wefocused on regulatory motifs in the integrative networks without miRNAs. We representedTFs and target genes with different colors to distinguish connectivity patterns between TFsfrom those including target genes. We generated 100 random networks using FANMOD in fullenumeration mode, with three exchanges per edge and at most three attempts for each edgeexchange. Among the 20 possible triads we picked the top six with highest fold enrichment inthe integrative regulatory networks (Supplementary File S6-Sheet 2). The set of top six motifsis the same for both the unsupervised and supervised integrative network, and the motifs havea very similar ordering (Figure S4B).

To identify network motifs that include miRNAs, we colored miRNAs, TFs, and other genes withthree different colors. Also, we distinguished between transcriptional and post-transcriptionalregulations by assigning different colors to these different types of regulations. We used FAN-MOD to identify the motifs as described above. There are 66 possible triads with the node/edgecoloring described above (Supplementary File S6-Sheet 1). We show the top seven motifs thatcontain at least one miRNA and that have the highest fold enrichment in the supervised inte-grative network (Figure S4A).

Network randomization. The random networks are generated from the original network bya series of edge switching operations (e.g. if there is and edge from a to b and c to d then anedge exchange operation changes these to an edge from a to d plus an edge from c to b). Edgesare only exchanged if their endpoints have the same node type such as transcription factorsor target genes. When randomizing the network, the edges are exchanged one after the other.We set the FANMOD options exchanges per edge and exchange attempts both to three, whichmeans all the edges in the network are traversed three times and at each time at most threeattempts were made to exchange an edge with another edge.

16

References

[1] U. Alon. Network motifs: theory and experimental approaches. Nat Rev Genet, 8:450–461, 2007.

[2] B. Graveley, A. Brooks, J. Carlson, M. Duff, J. Landolin, and et al. The developmental transcriptomeof drosophila melanogaster. Nature, 471:473–479, 2010.

[3] S. Roy, J. Ernst, P. Kharchenko, P. Kheradpour, N. Negre, and et al. Identification of functionalelements and regulatory circuits by drosophila modencode. Science, 330:1787–97, 2010.

[4] R. Shalgi, D. Lieber, M. Oren, and Y. Pilpel. Global and local architecture of the mammalianmicrorna-transcription factor regulatory network. PLoS Comput Biol, 3:e131, 2007.

[5] J. Tsang, J. Zhu, and A. van Oudenaarden. Microrna-mediated feedback and feedforward loops arerecurrent network motifs in mammals. Mol Cell, 26:753–767, 2007.

[6] S. Wernicke and F. Rasche. Fanmod: a tool for fast network motif detection. Bioinformatics,22:1152–1153, 2006.

[7] X. Yu, J. Lin, D. Zack, J. Mendell, and J. Qian. Analysis of regulatory network topology revealsfunctionally distinct classes of micrornas. Nucl Acids Res, 36:6494–6503, 2008.

17

supplementary information predictive regulatory models in...

Documents