cis/tf discovery for arabidopsis aristotelis tsirigos email: [email protected] nyu computer...
Post on 20-Dec-2015
214 views
TRANSCRIPT
2
Outline
• Input data
• The proposed model
• Results on yeast
• Results on arabidopsis
• Unsupervised pattern discovery
3
Input data
4
Input data~
23,0
00 g
en
es
25 points1,500bp
upstream
gctaagc...
5
Normalization~
23,0
00 g
en
es
25 points1,500bp
upstream
normalize columns(mean=0)
gctaagc...
6
Filtering~
23,0
00 g
en
es
25 points1,500bp
upstream
normalize columns(mean=0, stdev=1)
~5,0
00 g
en
es
25 pointsgctaagc...motif
bitmap
001011…
filter outlow-variance
7
The proposed model
8
Assumption 1
A single TF binds on a single cis element (motif)
Source: U.S. Department of Energy Genomics (http://doegenomestolife.org)
9
Assumption 2
TFs regulate genes sharing a motif only on subset of conditions
TF & regulated genes (group #1)
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
conditione
xp
res
sio
nTF & regulated genes (group #2)
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
condition
ex
pre
ss
ion
10
Expression pattern #1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
conditionex
pre
ssio
nExpression pattern #2
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
condition
exp
ress
ion
Assumption 2 (cont’d)
TFs regulate genes sharing a motif only on subset of conditions
11
Assumption 3The TF expression correlates with the
sum of the partially correlating expression patterns
sum of genes
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
condition
expr
essi
on
12
Objective
• For each cis element (motif):
– discover groups of co-regulated genes
– compute aggregate motif expression
• For each TF:
– find best correlating motifs
13
The algorithm – step 1~
5,0
00 g
en
es
step 1: clustering
25 points
.
.
.
.
.
.
14
The algorithm – step 2~
5,0
00 g
en
es
step 1: clustering
25 points
step 2 for any motif
compute its gene set
.
.
.
15
The algorithm – step 3~
5,0
00 g
en
es
step 1 clustering
25 points
step 2 for any motif
compute its gene set
step 3 compute the distribution of its genes into the clusters.
.
.
16
The algorithm – step 4~
5,0
00 g
en
es
step 1 clustering
25 points
step 2 for any motif
compute its gene set
step 3 compute the distribution of its genes into the clusters
step 4 determine overrepresented
clusters using t-test
.
.
.
17
The algorithm – final step~
5,0
00 g
en
es
25 points
final stepcompute motif
aggregate expression
25 points
.
.
.
18
Yeast
19
Example TF: BAS1
RANK MOTIF OCCUR corr score 1 gactcg 46 0.6446 66 2 cgagtc 46 0.6446 16 3 gactaa 163 0.6381 66 4 ttagtc 163 0.6381 33 5 tcggct 87 0.6374 33 ... 12 gctagt 110 0.6268 33 13 agtcac 137 0.6262 83 p-value=0.079 ... 27 gagtca 136 0.6192 100 p-value=0.004
Using cis/TF version 1:
20
Example TF: BAS1
Using cis/TF version 2:
RANK MOTIF OCCUR signf corr score 1 ctgact 122 0.62 0.66 33 2 agtcag 122 0.62 0.66 83 3 ggttta 187 0.62 0.63 50 4 taaacc 187 0.62 0.63 33 5 gagtca 136 0.68 0.63 100 p-value=0.002 6 tgactc 136 0.68 0.63 33 7 atttga 378 0.64 0.63 33 8 tcaaat 378 0.64 0.63 50 9 agtggc 126 0.66 0.61 50 10 gccact 126 0.66 0.61 50
21
Cluster #1: correlation = 0.02
-5
-4
-3
-2
-1
0
1
2
3
4
5
1
BAS1
#1
22
Cluster #2: correlation = -0.05
-5
-4
-3
-2
-1
0
1
2
3
4
5
1
BAS1#2
23
Cluster #0: correlation = 0.18
-5
-4
-3
-2
-1
0
1
2
3
4
5
1
BAS1#0
24
Cluster #4: correlation = -0.35
-5
-4
-3
-2
-1
0
1
2
3
4
5
1
BAS1#4
25
Cluster #4: correlation = 0.63
-5
-4
-3
-2
-1
0
1
2
3
4
5
1
BAS1
#3
26
Conclusions
Advantages of version 2:
gives ability to focus on gene cluster that correlates best with a given TF
thus, increases overall correlation and motif rank
offers a measure of motif significance
can be extended to pairs of TFs/motifs
27
Arabidopsis
28
Procedure• Permute gene cluster assignment
• Compile list of putative motifs
• Compute significance score of known motifs
• Repeat 1000 times
• Compute p-value of the score
0
20
40
60
80
100
120
140
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
ranking score
# o
f experi
men
ts
p-val = 0.006
30
TF discovery?
Need data for training!
(TFs and their associated binding cites)
Parameters to be estimated: number of clusters
motif size & degeneracy
31
Pattern discovery
32
TF-driven pattern discovery
• Unsupervised pattern discovery
• Find groups of genes partially correlating with TF
• Apply statistical filter
• Look for over-represented motifs in genes’ upstream regions
• Data for validation?
33-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
AT1G73230 (TF)
AT1G53290
AT5G59880
34
Pattern discovery example
TF & regulated genes (group #2)
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
condition
expr
essi
on
35
“Predicting Gene Expression form Sequence”Beer & Tavazoie, Cell 2004
• Group genes in 49 clusters• Predict gene cluster using motifs discovered in
its upstream region
36
0.00
0.05
0.10
0.15
0.20
-1 -0.5 0 0.5 1
correlation
freq
uenc
yALL
2,500 genes
PAC
RRPE
PAC&RRPE
37
Conclusions
38
ConlusionsTwo options:
• Supervised training:
– uses background knowledge to construct model
– needs more training data
• Unsupervised pattern discovery:
– minimal model bias (no prior knowledge)
– needs more ‘expert’ help to filter results