large scale prediction of transcription factor binding ... · the problem of finding cis-regulatory...

Large Scale Prediction of

Transcription Factor Binding

Sites for Gene Regulation using

Cloud Computing

Zhengchang Su

Department of Bioinformatics and Genomics

The University of North Carolina at Charlotte

6-2-2011 at Cloud Futures Workshop 2011

Outline

Introduction to the problem

GleClubs: an algorithm for large scale

prediction of transcription factor binding sites

(TFBSs)

Parallelization of GleClubs on a distributed

memory cluster using MPI

Porting GleClubs on Windows Azure

Platform

1. Coding sequences/genes: specify the

cellular components (proteins and RNAs)

of an organism; consist of ~85% of

genome sequences.

2. Regulatory sequences: specify when, where, how fast, and

how much the product of a gene should be produced;

usually are located in intergenic sequences, consisting

of~15% of genome sequences.

Structure of a bacterial genome

Genes DNA

Regulatory

sequences

There are two types of functional sequences

in a bacterial genome:

Intergenic sequences

The functions of regulatory sequences

Regulatory sequences control gene expression through

interacting with a transcription factor (TF), thus, they are also

called TF binding sites (TFBSs).

Adjacent genes in a bacterial genome are often organized in a

transcription unit called an operon.

TSS

+1

Promoter TFBSs Terminator

Polycistronic mRNA

Proteins

Transcription

Translation

-10 -35 -300

Ribosome

binding site 3’ UTR

5’ UTR Regulatory

sequences

TF1 s TF2

a a b b

Explosion of genome sequence data Since 1995, and particularly, since 2007 the number of sequenced genomes increases exponentially.

Archaea Bacteria Eukaryota Total

C om plete 109 1486 154 1749

In pipline 201 6045 2006 8252

As of 5-28-2011 http://www.genomesonline.org

http://www.genomesonline.org/

A paramount goal of computational biology One of the most challenging goals of computational biology is

to understand the functions of an organism solely from its

genome sequences.

The first step towards this goal is to know the part list of its cells:

• Genes: encoding cellular components for biological

functions

• TFBSS, encoding controlling programs for

making cellular components

The gene finding problem: is relatively well-solved, so

we know vast majority of genes in almost any sequenced

prokaryotic genomes—more than 1,486 genomes.

The regulatory sequence finding problem: remains a

largely unsolved-problem, so we know very little about

TFBSs in almost all sequenced genomes.

TACGAT

TATAAT

TATAAT

GATACT

TATGAT

TATGTT

Examples of

s70 binding

sites in the E.

coli genome

Causes for the difficulty of cis-regulatory

sequence prediction cis-regulatory elements are short (6-25bs) and degenerate;

They are located in intergenic regions that are much longer;

There is usually no base usage bias, any sequence segment

can be potentially a cis-regulatory element. 150 known canonical CRP

binding sites

22 known A-rich CRP

binding sites

26 known T-rich CRP

binding sites

Examples of

CRP binding

sites in the E.

coli genome

A collection of similar binding sites of a TF is called a motif.

The motif-finding problem

Since there are usually no fixed patterns of cis-regulatory

elements of a TF, a cis-regulatory element can only be

predicted by comparing a set of sequences that are likely to

contain the binding site of the same TF.

The problem of finding cis-regulatory elements in a given set

of sequences is called the motif-finding problem.

Numerous sequence-based motif-finding algorithms and tools

has been developed, and they are all based on the

assumption that binding sites of a TF are more conserved than

the flanking sequences.

Unfortunately, current motif-finding tools often return too many

false positives while still missing true binding sites.

Methods for finding a set of intergenic

sequences for motif-finding One gene, multiple genomes approach---phylogenetic

footprinting: in closely related species, more often both the

coding sequences and regulatory sequences of orthologous

genes are conserved.

+1 -10 -35 -300

+1 -10 -35 -300

Hom

olo

gous

A operon from another genome

TFBSs Genes

A typical phylogenetic footprinting procedure

Orthologs identification Intergenic regions

Motif

finding

T.gi

G1.gi

G2.gi

Gn.gi

.

.

.

.

.

.

.

.

.

This procedure can be applied to each gene in the target

genome to predict TFBSs for all genes.

Similar putative TFBSs are then clustered to form a motif.

However, the prediction sensitivity and specificity of such an

approach is very low due to the low prediction accuracy of

motif-finding tools.

The predictions are biased to the target genome, so the

binding sites in reference genomes cannot be fully predicted.

Prediction of operons An algorithm for operon prediction: JPOP(joint prediction of

operons)

Homology mapping across

multiple related genomes

Conserved gene neighborhoods

A neural network trained on E.

coli data set

Operon pair

X. Chen, Z. Su, et al. NAR, 2004, 32(7):2147-57

Non-operon

pair

Gene pairs on

the same

strand

1. intergenic distances

2. predicted gene functions (COG)

3. phylogenetic profiles

4. k-mer frequency statistics in the

intergenic region

P. Dam, V. Olman, K. Harris, Z. Su, Y. Xu. NAR.

2007;35(1):288-98.

TF conservation tree for

139 g-proteobacterial

genomes

Group A Group B

Group C Group D

Group E

Selection of a group of target genomes

Zhang S, Xu M, Li S, Su Z.

Nucleic Acids Res. 2009 Jun;37(10):e72.

Extraction of inter-operonic sequences associated with

each orthologous operon groups

a

b

c

d

e

f

g

I1 I2 I3

1g 2g

1g 2g 3g 4g 5g 6g

4g5g

6g7g

2g 3g 4g

6g 7g

8g

G1

G2

G3

Using a group of 32 g-proteobacteria genomes (Group D)

including E. coli K12, we identified 4,103 orthologous operon

groups and the same number of inter-operonic sequence sets.

Identifying dense subgraphs as possible motifs

Input motifs

Motif similarity graph

wi,j

4,103 inter-operonic sequence

sets x 40 motifs

= 160 x 103 nodes

1.6 x 105 x 1.6 x 105 /2 = 13 x

109 possible edges

Two assumptions:

1. A true motif is more likely to be identified by different tools in

the same inter-operonic sequence set;

2. A true motif is more likely to identified by in different sets of

inter-operonic sequences sets.

Flowchart of the motif clustering algorithm

G2 MCL

Inducing

subgraphs in G1

G1 with a higher density

c1

c2

c3

c4

ct

Mapping the

clusters to G1

G3

c1

c2

c4 ct

c3

c1

c2

c3

c4

ct

…

MCL …

Input motifs

b a G2 with a lower density

G1

1s 2s ns

Zhang S, Li S, Pham PT, Su Z.

BMC Bioinformatics. 2010;11:397

Flowchart of the motif clustering algorithm

Find cliques

associated with

each node

Motif 1 Motif 2 Motif 3 Motif 4 Motif 5 Motif m

Find the quasi-cliques

1s 2s ns

Rank motifs

Motif-finding

Genome 1

Sort by genomes

Genome 2 Genome N Genome 3

L

i

PiIL

NnreClusterSco1

),(1

exp)log(

Parallelization of GleClubs on a cluster

Predict operons for each

genome in each group

Divide genomes into

multiple target groups

Predict orthologous operon groups

and extract intergenic sequences

Predict multiple over-represented k-mer

motifs in each intergenic sequence set

Computing similarity scores of motifs

Embarrassingly parallel computation




To avoid the I/O bottleneck, each process

reads an equal portion of files and shares this

data with all the other processes.

Construction of the motif similarity

graph and cut it into subgraphs Serial computation

Find and merge quasi-cliques

To balance work loads in processes, subgraphs

are assigned to processes, larger subgraphs are

further split among processes using a

master/slave models

Predict TFBSs in each genome Embarrassingly parallel computation

Port GleClubs on the Windows Azure Platform

Predict operons for each

genome in each group

Divide genomes into

multiple groups

Predict orthologous operon groups

and extract intergenic sequences

Predict over-represented k-mer motifs

in each intergenic sequence set

Computing similarity scores of motifs

Implemented as a SOA application




MPI, Hadoop, Mapereduce or others ?

Construction of the motif similarity

graph and cut it into subgraphs

Find and merge quasi-cliques

Predict TFBSs in each genome Implemented as a SOA application

Serial computation

MPI, Hadoop, Mapreduce or others ?

Acknowledgments

• Dr. Shaoqiang Zhang

• Dr. Xia Dong

• Peter Pham

• Shan Li

• Meng Niu

• Minli Xu

• Chen Xu

• Matthew Smith

Funding sources:

• Dr. Barry Wilkinson

• Ridhi Dua

UNC Charlotte

• Dr. Srinivas Akella

• Youjie Zhou

Thank You !

Questions ?

large scale prediction of transcription factor binding ... · the problem of finding cis-regulatory...

Documents