supplementary information predictive regulatory models in...
TRANSCRIPT
Supplementary Information
Predictive regulatory models in Drosophila melanogaster byintegrative inference of transcriptional networks
Daniel Marbach♣, Sushmita Roy♣, Ferhat Ay, Patrick E. Meyer, Tamer Kahveci,Christopher A. Bristow, and Manolis Kellis
Companion Website
compbio.mit.edu/flynet Data and source code to reproduce networks and validation results(also included in Supplementary File 9).
Supplementary Figures
S1 Fraction of integrative network edges with and without physical support . . . . . 4
S2 Degree distributions of integrative networks . . . . . . . . . . . . . . . . . . . . . 5
S3 Structural properties of integrative networks . . . . . . . . . . . . . . . . . . . . . 6
S4 Network motifs of integrative networks . . . . . . . . . . . . . . . . . . . . . . . . 7
S5 Prediction of novel functional annotations . . . . . . . . . . . . . . . . . . . . . . 8
S6 Expression prediction for eight representative target genes . . . . . . . . . . . . . 9
S7 Squared errors for expression prediction using integrative and physical networks . 10
S8 Correlation of predictable and unpredictable genes across two time courses . . . . 11
S9 Definition of chromatin profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
S10 Relative importance of input features for the unsupervised integrative network . 13
Supplementary Tables
S1 Performance for gene expressions prediction . . . . . . . . . . . . . . . . . . . . . 14
Supplementary Notes
S1 Network motifs for integrative and mixed regulatory networks . . . . . . . . . . . 15
1
Supplementary Files
S1 integrative regulatory networks.zip This zip file contains two tab-separated files
representing the unsupervised and supervised integrative networks. The files have three
columns: the first column is the name of the regulator, the second column is the name of
the target gene, and the third column is the score of the edge connecting the regulator
and the target. Genes are given by their FBgn numbers.
S2 mixed regulatory networks.zip The mixed transcriptional and post-transcriptional
(TFs and microRNA) regulatory networks used for the network motif analysis described
in Supplementary Note S1.
S3 struct properties sup network.zip Sheet 1: Summary of the number of nodes and
edges in the supervised integrative network and the mixed regulatory network. Sheet 2:
Transcription factors in the supervised integrative and their in/out-degrees. Sheet 3:
miRNAs in the mixed regulatory network and their in/out-degrees. Sheet 4: In-degrees
of target genes and transcription factors of the supervised integrative network.
S4 struct properties unsup network.zip Sheet 1: Summary of the number of nodes and
edges in the unsupervised integrative network and the mixed regulatory network. Sheet
2: Transcription factors in the unsupervised integrative network and their
in/out-degrees. Sheet 3: miRNAs in the mixed regulatory network and their
in/out-degrees. Sheet 4: In-degrees of target genes and transcription factors of the
unsupervised integrative network.
S5 GO enrichment by node degree.xls GO biological process enrichments for high
(> 150) and low (= 1) in degree gene sets.
S6 triads.xlsx Sheet 1: List of all possible triads (subnetworks with only 3 nodes) for
mixed regulatory networks. Sheet 2: List of all possible triads (subnetworks with only 3
nodes) for the integrative networks.
S7 gene function prediction.zip This archive has two files: One tgz file and sub-directories
for the supervised and one for the unsupervised integrative network. Within each
sub-directory is a file per GO process. Each line has four columns: first column is the
name of the process, second column is the FBgn number of a gene, third column is the
FDR, and fourth column is a label to identify the type of prediction. Specifically the
labels take the values NEW (previously unannotated gene), KNOWN PRED (additional
new prediction for a previously annotated gene), KNOWN HIT (previously annotated
with the same function, that is, true positive). Only predictions that satisfy an FDR of
2
0.6 or lower and that are not in the FALSE set are included.
S8 ImaGO enrichment for predicted GO terms.xls ImaGO enrichments for predicted
GO terms for both unannotated genes and genes with previous annotations.
S9 datasets and source code.zip Data and source code to reproduce networks and
validation results.
3
Supplementary Figures
0 0.05 0.1 0.15 0.2 0.25 0.3
Supervised integrative network
Fraction of edges considered0 0.05 0.1 0.15 0.2 0.25 0.3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Unsupervised integrative network
Frac
tion
RED
fly e
dges
reco
vere
d
Fraction of edges considered
TotalNot physical
Motif only
Motif & ChIP
ChIP only
Total
Not physical
Motif only
Motif & ChIP
ChIP only
Figure S1: Fraction of integrative network edges with and without physical supportRecovery of known REDfly edges at different thresholds for the two integrative networks. Theblack curve (total) shows the fraction of the 204 REDfly edges recovered versus the fraction ofpredicted interactions considered for the integrative networks (cf. Fig. 2B). The area betweenthe colored curves shows the fraction of these recovered edges that are supported by ChIP only,ChIP and conserved motifs, conserved motifs only, and functional datasets only (chromatin andexpression). The majority of recovered edges are supported by ChIP and/or conserved motifs,especially the edges at the top of the prediction lists. However, at lower ranks some REDflyedges are also recovered based on functional data alone.
4
1
10
100
1000
10000
1 10 100
Freq
uenc
y
In-degree
In-degree distributions
Freq
uenc
y
Out-degree
Out-degree distributions
Integrative (supervised)Integrative (unsupervised)
Integrative (supervised)Integrative (unsupervised)
0.001
0.01
0.1
1
10
100
1 10 100 1000
Figure S2: Degree distributions of integrative networksNumber of targets per TF (out-degree) and number of regulators per target (in-degree) for thesupervised and unsupervised integrative networks, plotted on a log-log scale. Both networksshow roughly a scale-free distribution for both in-degrees and out-degrees, with a small numberof regulators showing many targets, and a small number of genes targeted by many regulators.
5
A B
Paramater Unsupervised network Supervised networkClustering Coefficient 0.58 0.48Network Diameter 8 8Characteristic Path Length 2.82 2.66Average Number of Neighbors 46.86 48.74Number of Connected Components 4 1
C
0.0001
0.001
0.01
0.1
1
1 10 100 1000 10000
C(k
) (lo
cal c
lust
erin
g co
effic
ient
)
k (degree)
Target GenesTFs
Target GenesTFs
0.0001
0.001
0.01
0.1
1
1 10 100 1000 10000
C(k
) (lo
cal c
lust
erin
g co
effic
ient
)
k (degree)
Supervised network Unsupervised network
Figure S3: Structural properties of integrative networks(A,B) Log-log plot of node degree (k) versus average clustering coefficient (C(k)) for the super-vised and unsupervised integrative networks. Target genes show higher clustering coefficients asthey are only connected to TFs, which themselves are highly interconnected leading to high C(k)values for target genes. Target genes with the highest in-degrees (most heavily regulated) showdecreased clustering coefficients, as their many regulators are less likely to be interconnected.In contrast, TFs are connected to both TFs and target-only genes (which themselves are notinterconnected), resulting in overall lower coefficients. However, the same trend remains, wherethe most heavily-interconnected TFs show decreased clustering coefficients, as a higher pro-portion of their targets include predominantly target-only genes, which are not interconnectedamong themselves. (C) Clustering coefficients, networks diameters, characteristic path lengths,and number of connected components of the integrative networks.
6
A BNetwork Motif Statistical Significance
Description Illustration Fold Enr. Z-score
Cross-regulating TFsco-targeting a miRNA 45.24 40.73(2 FFLs)
67.89 76.95
Cross-regulating TFswith a feedback from 3.79 8.74a regulated miRNA(1 FBL, 1 FFL) 2.24 7.51
Cross-regulating TFsco-targeted by a miRNA 1.95 5.91(2 FFLs)
1.94 8.76
Feed-forward loop from amiRNA involving a TF in 1.51 7.27regulation of a target gene(1 FFL) 1.48 11.01
Feedback loop includinga miRNA 1.66 2.22(1 FBL)
1.45 2.7
Cross-regulating TFsco-targeting a miRNA 1.57 4.33with one feedback(3 FFLs, 1 FBL) 1.41 8.52
Cross-regulating TFsco-targeted by a miRNA 1.39 1.47with one feedback(3 FFLs, 1 FBL) 1.34 1.33
Network Motif Statistical Significance
Description Illustration Fold Enr. Z-score
Cross-regulating TFsco-targeting another TF 17.919 104.23(Double FFL)
23.5 238.43
Cross-regulatoryclique of TFs 2.891 10.65(Six FFLs)
14.669 13.93
Cross-regulating TFsco-targeting a target gene 1.594 69.01(Double FFL)
2.368 125.43
Cross-regulating TFsco-targeted by another TF 1.989 23.72(Double FFL)
1.725 38.3
Cross-regulating TFscreating a feed-forward 1.349 7.52and a feedback loop
1.439 16.55
Feedback loop between 1.537 3.24three TFs
1.154 2.62
Unsupervised networkSupervised network
miRNATranscription factorTarget gene
Figure S4: Network motifs of integrative networks(A) Most frequent three-node motifs and their respective fold enrichment and Z-score occurringin the mixed (transcriptional and post-transcriptional) and (B) in the integrative networksalone.
7
A B
0.001
0.01
0.1
1
0.1 0.2 0.3 0.4 0.50FDR
0.1 0.2 0.3 0.4 0.50FDR
0.1 0.2 0.3 0.4 0.50FDR
0.1 0.2 0.3 0.4 0.50FDR
C D
Frac
tion
of g
enes
ann
otat
ed 0.001
0.01
0.1
1
Frac
tion
of g
enes
ann
otat
ed
0.0001
0.001
0.01
0.1
1
Frac
tion
of g
enes
ann
otat
ed
0.001
0.01
0.1
1
Frac
tion
of g
enes
ann
otat
ed
Unannotated genesPreviously annotated genes
Unannotated genesPreviously annotated genes
Uns
upve
rvis
ed n
etw
ork
ExpressionNetwork (unsupervised)Combined
Supv
ervi
sed
netw
ork
ExpressionNetwork (unsupervised)Combined
ExpressionNetwork (unsupervised)Combined
ExpressionNetwork (unsupervised)Combined
Figure S5: Prediction of novel functional annotationsThe number of newly predicted GO process annotations at different FDR cutoffs for genesthat are already annotated with other functions (A,C) and for genes that have no previousannotations (B,D). Predictions based on the unsupervised integrative network are shown in thefirst row (A,B), predictions based on the supervised integrative network are shown in the secondrow (C,D).
8
A Regulators Time-courseB
−1
0
1
2
log
expr
essi
on
Su(var)2−HP2MedCG18011CG11902chmTruePredicted
−1
0
1
2
log
expr
essi
on
aopCG2116CG8944crpCG17802
aop
CG2116
CG8944
crp
CG17802
TruePredicted
−2
−1
0
1
2
log
expr
essi
on
CG2116mxcCG14965CG17359lid
CG2116
mxc
CG14965
CG17359
lid
TruePredicted
−1
0
1
2
log
expr
essi
on
osaCG1845CG1792CG1620Ssrp
CG4360
pita
CG9705
Meics
row
TruePredicted
CG4360pitaCG9705MeicsrowTruePredicted
2
−1
0
1
log
expr
essi
on
−2
−1
0
1
2
log
expr
essi
on
psqphoEcRexdCG17568
psq
pho
Ecr
exd
CG17568
TruePredicted
−1
0
252015Time
1050 30
1
2
log
expr
essi
on
CG7839CG2875CG10565CG12744Cenp−C
CG7839
CG2875
CG10565
CG12744
Cenp−C
TruePredicted
Su(var)2−HP2
Med
CG18011
CG11902
chm
0.089
0.081
0.078
0.077
0.074
CG8290
Rox8
0.166
0.092
0.086
-0.083
-0.080
-0.485
0.464
-0.454
0.424
0.36
CG4497
0.214
-0.212
0.181
0.172
0.164
Ranpb21
osa
CG1845
CG1792
CG1620
Ssrp
0.259
0.254
-0.253
0.234
0.217
CG1518
0.323
0.126
0.122
0.119
0.111
0.215
0.198
0.176
0.146
-0.145
CG9300
AGO1
−2
−1
0
1
log
expr
essi
on
CG7045BEAF−32Myb
CG7045
BEAF−32
Myb
TruePredicted0.891
-0.312
0.252
CG13032
Figure S6: Expression prediction for eight representative target genes(A) The top 5 regulators and the magnitude of the regression coefficients (CG13032 has onlythree regulators). (B) True and predicted expression profiles for the target genes across thedevelopmental time course. The expression profiles of the 5 regulators are also shown.
9
0 5 10 15 20−0.04
−0.02
0
0.02
0.04
0.06
0.08
0.1
Squared Error
Prop
ortio
n of
gen
es
Integrative (unsupervised)Integrative (supervised)Conserved motifsTF binding (ChIP)
Figure S7: Squared errors for expression prediction using integrative and physicalnetworksExpression prediction was performed independently using the two integrative networks and thetwo physical networks based on TF binding (ChIP) and conserved motifs. For each of the fournetworks, we estimated the squared error between the true and predicted expression profiles,discretized the error into 20 bins, and estimated the proportion of genes with error less than thevalue corresponding to a particular bin. We did the same for 10 randomized networks, obtainedan average proportion from randomized networks, and computed the difference between thesetwo proportions (y-axis). In contrast to the two physical networks (ChIP and conserved motifs),the integrative networks show an excess of genes that are predicted accurately (low squarederror) compared to predictions based on randomized networks.
10
0
0.2
0.4
0.6
0.8
1
Pearson CC Pearson CC
Cum
ulat
ive
dist
ribut
ion
func
tion
Unsupervised integrative network Supervised integrative network
Predictable genesUnpredictable genes
0-0.5 0.5 1
-0.6 −0.2 0.2 0.6 1
Figure S8: Correlation of predictable and unpredictable genes across two timecoursesDistribution of Pearsons correlation of predictable and unpredictable genes across the two devel-opmental time courses. Predictable genes showed a significantly higher correlation between thetwo datasets compared to the remaining genes (P-value < 1E-7), suggesting that predictablegenes are also more reproducible.
11
1kb upstream TSS 5’ UTR Coding sequence 3’ UTR
TSS
1kb downstream TES
1 1 1 0 0 0 0 1 1 1
Binary feature vectors
Mark AMark B
Chromatin mark A Chromatin mark B
Figure S9: Definition of chromatin profilesChromatin profiles were defined by binning each gene into five regions and constructing binaryvectors indicating the presence of each mark in these regions. The complete chromatin profileof a gene is obtained by concatenating the profiles of all marks for samples (time points or celllines) into a single binary vector.
12
6%
27%
9%15%
19%
23%
TF binding (ChIP)
Conserved motifs
Chromatin TC
Chromatin CL
Microarray TC
RNAseq TC
Figure S10: Relative importance of input features for the unsupervised integrativenetworkThe pie chart shows the relative importance of the different input features in the unsupervisedintegrative network, analogous to Fig. 2C for the supervised network.
13
Supplementary Tables
10-fold cross-validation in time course Hold-out datasets in cell lines
Network #TFs with expr.
#Targets with expr.
#Pred. better than random
#Pred. worse than random
Enrichm. vs. random
Predictable genes
Unpredictable genes
Pred. genes(RNAseq)
Pred. genes(Mircroarray)
Binding 72 10,248 1,707 1,642 1.04 656 6% 1,030 10% 296 45% Motifs 101 9,065 1,055 977 1.08 411 5% 214 2% 215 52% REDfly 74 82 13 3 3.91 12 15% 0 0% 4 33% Integrative (sup.) 541 8,433 1,988 787 2.53 1,502 18% 323 4% 992 66%
Integrative (unsup.) 582 10,088 2,446 1,202 2.03 1,711 17% 629 6% 1,138 67%
843 56%
1,118 65%
Table S1: Performance for gene expressions predictionShown are the number of TFs, targets, and predictable targets in the developmental timecourse and cell lines for the integrative, physical, and REDfly networks. The first two columnsspecify the number of TFs and targets that are in the expression datasets and used for predictingexpression. The next four columns report cross-validation performance compared to randomizednetworks in the developmental time course. #Predictions better/worse than random specifiesthe average number of genes with significantly lower/higher error than randomized networks.The next two columns report the number of predictable genes and unpredictable genes, definedby genes that we can predict expression better (or worse) in 60% of the randomized networks.The last four columns report the number and percentage of predictable genes that remainpredictable in the two cell-line hold-out datasets.
14
Supplementary Notes
S1 Network motifs for integrative and mixed regulatory networks
Results
We sought to reveal potentially meaningful network components by studying over-representedsubgraphs known as network motifs1 (see below for methodological details). Given the impor-tance of microRNA (miRNA) regulation in feedback and repression in animal development, wecomplemented our transcriptional regulatory networks with post-transcriptional information onmiRNA regulation. We included regulatory edges from miRNAs to their target genes from aprevious study,3 where they were identified using evolutionary-conserved 7-mer miRNA seedmotifs in three prime untranslated regions for 108 conserved microRNAs in 79 distinct fam-ilies. Regulatory edges from TFs to miRNAs were identified based on binding of the TF inexperimentally-inferred promoter regions for the precursor transcripts of miRNAs.2,3 The re-sulting combined pre- and post-transcriptional regulatory network contains 281 TF-to-miRNAedges and 599 miRNA-to-TF edges (out of a total of 6,141 miRNA-to-target edges).
We found 13 motifs in our mixed network that show significant fold enrichments compared torandomized networks (Figure S4), of which seven contain miRNA regulators. Four of themotifs in Figure S4A correspond to miRNA-mediated feedback loops (M2, M5, M6, M7),which may act as a noise buffer and prevent random switching events, as shown for c-Myc,E2F1, and miR-17-20 in human.5 Three motifs constitute of only feed-forward loops (FFLs)(M1, M3, M4). Among these, the FFL in motif M4 is initiated by a miRNA and demonstratesco-operation between a TF and a miRNA to regulate a common target gene; motif M4 wasalso found to be enriched in human and mouse regulatory networks.4 Motif M3 correspondsto a double-FFL that can provide memory effect by converting transient signals of a commonupstream regulator into states that persist after initial induction, which can be particularlyimportant in developmental processes where decisions of cell fate occur1,7 (indeed, enrichedGO terms for M3 target nodes include developmental processes, p-value < 10−10). Lastly, M1,which has two TFs cross-regulating each other and both targeting a common miRNA, was alsosignificantly enriched in human regulatory networks.7
Overall, the inferred regulatory networks of D. melanogaster are enriched in similar networkmotifs as mammalian regulatory networks, suggestive of robustness, feedback, and co-operationbetween the transcriptional and post-transcriptional layers.
Methods
In order to identify building blocks of the regulatory networks, we considered all possible directedtriads (i.e., connected three node sub-networks) of these networks. Among these triads, networkmotifs are the ones that appear significantly more compared to an ensemble of random networkswith the same degree distribution as the original network (see Network randomization below).We used two different measures to assess the statistical significance of each triad, namely foldenrichment and z-score. We calculate the fold enrichment as the ratio of the percentage of each
15
triad in the original network to its average percentage in the randomized networks. Formally,for triad i the fold enrichment:
Fi =N ∗ Por∑N
j=1 Pj
(1)
where Por is the percentage of triad i in original network, and P1 to PN are the percentagesof triad i in N random networks. We compute these percentages and the z-scores for eachtriad using an existing network motif identification tool, FANMOD.6 We sort all possible triadsaccording to their fold enrichment values calculated using 100 random networks generated byFANMOD. The top triads of this list are the most important network motifs in the originalnetwork. We also report the z-score, the number of appearances and the standard deviation ofeach network motif to better understand the portion of the original network that is explainedby these conserved connectivity patterns (see Fig. S4).
We searched for network motifs in two different settings for each inferred network. First wefocused on regulatory motifs in the integrative networks without miRNAs. We representedTFs and target genes with different colors to distinguish connectivity patterns between TFsfrom those including target genes. We generated 100 random networks using FANMOD in fullenumeration mode, with three exchanges per edge and at most three attempts for each edgeexchange. Among the 20 possible triads we picked the top six with highest fold enrichment inthe integrative regulatory networks (Supplementary File S6-Sheet 2). The set of top six motifsis the same for both the unsupervised and supervised integrative network, and the motifs havea very similar ordering (Figure S4B).
To identify network motifs that include miRNAs, we colored miRNAs, TFs, and other genes withthree different colors. Also, we distinguished between transcriptional and post-transcriptionalregulations by assigning different colors to these different types of regulations. We used FAN-MOD to identify the motifs as described above. There are 66 possible triads with the node/edgecoloring described above (Supplementary File S6-Sheet 1). We show the top seven motifs thatcontain at least one miRNA and that have the highest fold enrichment in the supervised inte-grative network (Figure S4A).
Network randomization. The random networks are generated from the original network bya series of edge switching operations (e.g. if there is and edge from a to b and c to d then anedge exchange operation changes these to an edge from a to d plus an edge from c to b). Edgesare only exchanged if their endpoints have the same node type such as transcription factorsor target genes. When randomizing the network, the edges are exchanged one after the other.We set the FANMOD options exchanges per edge and exchange attempts both to three, whichmeans all the edges in the network are traversed three times and at each time at most threeattempts were made to exchange an edge with another edge.
16
References
[1] U. Alon. Network motifs: theory and experimental approaches. Nat Rev Genet, 8:450–461, 2007.
[2] B. Graveley, A. Brooks, J. Carlson, M. Duff, J. Landolin, and et al. The developmental transcriptomeof drosophila melanogaster. Nature, 471:473–479, 2010.
[3] S. Roy, J. Ernst, P. Kharchenko, P. Kheradpour, N. Negre, and et al. Identification of functionalelements and regulatory circuits by drosophila modencode. Science, 330:1787–97, 2010.
[4] R. Shalgi, D. Lieber, M. Oren, and Y. Pilpel. Global and local architecture of the mammalianmicrorna-transcription factor regulatory network. PLoS Comput Biol, 3:e131, 2007.
[5] J. Tsang, J. Zhu, and A. van Oudenaarden. Microrna-mediated feedback and feedforward loops arerecurrent network motifs in mammals. Mol Cell, 26:753–767, 2007.
[6] S. Wernicke and F. Rasche. Fanmod: a tool for fast network motif detection. Bioinformatics,22:1152–1153, 2006.
[7] X. Yu, J. Lin, D. Zack, J. Mendell, and J. Qian. Analysis of regulatory network topology revealsfunctionally distinct classes of micrornas. Nucl Acids Res, 36:6494–6503, 2008.
17