large scale prediction of transcription factor binding ... · the problem of finding cis-regulatory...
TRANSCRIPT
Large Scale Prediction of
Transcription Factor Binding
Sites for Gene Regulation using
Cloud Computing
Zhengchang Su
Department of Bioinformatics and Genomics
The University of North Carolina at Charlotte
6-2-2011 at Cloud Futures Workshop 2011
Outline
Introduction to the problem
GleClubs: an algorithm for large scale
prediction of transcription factor binding sites
(TFBSs)
Parallelization of GleClubs on a distributed
memory cluster using MPI
Porting GleClubs on Windows Azure
Platform
1. Coding sequences/genes: specify the
cellular components (proteins and RNAs)
of an organism; consist of ~85% of
genome sequences.
2. Regulatory sequences: specify when, where, how fast, and
how much the product of a gene should be produced;
usually are located in intergenic sequences, consisting
of~15% of genome sequences.
Structure of a bacterial genome
Genes DNA
Regulatory
sequences
There are two types of functional sequences
in a bacterial genome:
Intergenic sequences
The functions of regulatory sequences
Regulatory sequences control gene expression through
interacting with a transcription factor (TF), thus, they are also
called TF binding sites (TFBSs).
Adjacent genes in a bacterial genome are often organized in a
transcription unit called an operon.
TSS
+1
Promoter TFBSs Terminator
Polycistronic mRNA
Proteins
Transcription
Translation
-10 -35 -300
Ribosome
binding site 3’ UTR
5’ UTR Regulatory
sequences
TF1 s TF2
a a b b
Explosion of genome sequence data Since 1995, and particularly, since 2007 the number of sequenced genomes increases exponentially.
Archaea Bacteria Eukaryota Total
C om plete 109 1486 154 1749
In pipline 201 6045 2006 8252
As of 5-28-2011 http://www.genomesonline.org
A paramount goal of computational biology One of the most challenging goals of computational biology is
to understand the functions of an organism solely from its
genome sequences.
The first step towards this goal is to know the part list of its cells:
• Genes: encoding cellular components for biological
functions
• TFBSS, encoding controlling programs for
making cellular components
The gene finding problem: is relatively well-solved, so
we know vast majority of genes in almost any sequenced
prokaryotic genomes—more than 1,486 genomes.
The regulatory sequence finding problem: remains a
largely unsolved-problem, so we know very little about
TFBSs in almost all sequenced genomes.
TACGAT
TATAAT
TATAAT
GATACT
TATGAT
TATGTT
Examples of
s70 binding
sites in the E.
coli genome
Causes for the difficulty of cis-regulatory
sequence prediction cis-regulatory elements are short (6-25bs) and degenerate;
They are located in intergenic regions that are much longer;
There is usually no base usage bias, any sequence segment
can be potentially a cis-regulatory element. 150 known canonical CRP
binding sites
22 known A-rich CRP
binding sites
26 known T-rich CRP
binding sites
Examples of
CRP binding
sites in the E.
coli genome
A collection of similar binding sites of a TF is called a motif.
The motif-finding problem
Since there are usually no fixed patterns of cis-regulatory
elements of a TF, a cis-regulatory element can only be
predicted by comparing a set of sequences that are likely to
contain the binding site of the same TF.
The problem of finding cis-regulatory elements in a given set
of sequences is called the motif-finding problem.
Numerous sequence-based motif-finding algorithms and tools
has been developed, and they are all based on the
assumption that binding sites of a TF are more conserved than
the flanking sequences.
Unfortunately, current motif-finding tools often return too many
false positives while still missing true binding sites.
Methods for finding a set of intergenic
sequences for motif-finding One gene, multiple genomes approach---phylogenetic
footprinting: in closely related species, more often both the
coding sequences and regulatory sequences of orthologous
genes are conserved.
+1 -10 -35 -300
+1 -10 -35 -300
Hom
olo
gous
A operon from another genome
TFBSs Genes
A typical phylogenetic footprinting procedure
Orthologs identification Intergenic regions
Motif
finding
T.gi
G1.gi
G2.gi
Gn.gi
.
.
.
.
.
.
.
.
.
This procedure can be applied to each gene in the target
genome to predict TFBSs for all genes.
Similar putative TFBSs are then clustered to form a motif.
However, the prediction sensitivity and specificity of such an
approach is very low due to the low prediction accuracy of
motif-finding tools.
The predictions are biased to the target genome, so the
binding sites in reference genomes cannot be fully predicted.
Prediction of operons An algorithm for operon prediction: JPOP(joint prediction of
operons)
Homology mapping across
multiple related genomes
Conserved gene neighborhoods
A neural network trained on E.
coli data set
Operon pair
X. Chen, Z. Su, et al. NAR, 2004, 32(7):2147-57
Non-operon
pair
Gene pairs on
the same
strand
1. intergenic distances
2. predicted gene functions (COG)
3. phylogenetic profiles
4. k-mer frequency statistics in the
intergenic region
P. Dam, V. Olman, K. Harris, Z. Su, Y. Xu. NAR.
2007;35(1):288-98.
TF conservation tree for
139 g-proteobacterial
genomes
Group A Group B
Group C Group D
Group E
Selection of a group of target genomes
Zhang S, Xu M, Li S, Su Z.
Nucleic Acids Res. 2009 Jun;37(10):e72.
Extraction of inter-operonic sequences associated with
each orthologous operon groups
a
b
c
d
e
f
g
I1 I2 I3
1g 2g
1g 2g 3g 4g 5g 6g
4g5g
6g7g
2g 3g 4g
6g 7g
8g
G1
G2
G3
Using a group of 32 g-proteobacteria genomes (Group D)
including E. coli K12, we identified 4,103 orthologous operon
groups and the same number of inter-operonic sequence sets.
Identifying dense subgraphs as possible motifs
Input motifs
Motif similarity graph
wi,j
4,103 inter-operonic sequence
sets x 40 motifs
= 160 x 103 nodes
1.6 x 105 x 1.6 x 105 /2 = 13 x
109 possible edges
Two assumptions:
1. A true motif is more likely to be identified by different tools in
the same inter-operonic sequence set;
2. A true motif is more likely to identified by in different sets of
inter-operonic sequences sets.
Flowchart of the motif clustering algorithm
G2 MCL
Inducing
subgraphs in G1
G1 with a higher density
c1
c2
c3
c4
ct
Mapping the
clusters to G1
G3
c1
c2
c4 ct
c3
c1
c2
c3
c4
ct
…
MCL …
Input motifs
b a G2 with a lower density
G1
1s 2s ns
Zhang S, Li S, Pham PT, Su Z.
BMC Bioinformatics. 2010;11:397
Flowchart of the motif clustering algorithm
Find cliques
associated with
each node
Motif 1 Motif 2 Motif 3 Motif 4 Motif 5 Motif m
Find the quasi-cliques
1s 2s ns
Rank motifs
Motif-finding
Genome 1
Sort by genomes
Genome 2 Genome N Genome 3
L
i
PiIL
NnreClusterSco1
),(1
exp)log(
Parallelization of GleClubs on a cluster
Predict operons for each
genome in each group
Divide genomes into
multiple target groups
Predict orthologous operon groups
and extract intergenic sequences
Predict multiple over-represented k-mer
motifs in each intergenic sequence set
Computing similarity scores of motifs
Embarrassingly parallel computation
Embarrassingly parallel computation
Embarrassingly parallel computation
Embarrassingly parallel computation
To avoid the I/O bottleneck, each process
reads an equal portion of files and shares this
data with all the other processes.
Construction of the motif similarity
graph and cut it into subgraphs Serial computation
Find and merge quasi-cliques
To balance work loads in processes, subgraphs
are assigned to processes, larger subgraphs are
further split among processes using a
master/slave models
Predict TFBSs in each genome Embarrassingly parallel computation
Port GleClubs on the Windows Azure Platform
Predict operons for each
genome in each group
Divide genomes into
multiple groups
Predict orthologous operon groups
and extract intergenic sequences
Predict over-represented k-mer motifs
in each intergenic sequence set
Computing similarity scores of motifs
Implemented as a SOA application
Implemented as a SOA application
Implemented as a SOA application
Implemented as a SOA application
MPI, Hadoop, Mapereduce or others ?
Construction of the motif similarity
graph and cut it into subgraphs
Find and merge quasi-cliques
Predict TFBSs in each genome Implemented as a SOA application
Serial computation
MPI, Hadoop, Mapreduce or others ?
Acknowledgments
• Dr. Shaoqiang Zhang
• Dr. Xia Dong
• Peter Pham
• Shan Li
• Meng Niu
• Minli Xu
• Chen Xu
• Matthew Smith
Funding sources:
• Dr. Barry Wilkinson
• Ridhi Dua
UNC Charlotte
• Dr. Srinivas Akella
• Youjie Zhou
Thank You !
Questions ?