cis/tf discovery for arabidopsis aristotelis tsirigos email: [email protected] nyu computer...

Cis/TF discovery for Arabidopsis

Aristotelis Tsirigosemail: [email protected]

NYU Computer Science

2

Outline

• Input data

• The proposed model

• Results on yeast

• Results on arabidopsis

• Unsupervised pattern discovery

3

Input data

4

Input data~

23,0

00 g

en

es

25 points1,500bp

upstream

gctaagc...

5

Normalization~

23,0

00 g

en

es

25 points1,500bp

upstream

normalize columns(mean=0)

gctaagc...

6

Filtering~

23,0

00 g

en

es

25 points1,500bp

upstream

normalize columns(mean=0, stdev=1)

~5,0

00 g

en

es

25 pointsgctaagc...motif

bitmap

001011…

filter outlow-variance

7

The proposed model

8

Assumption 1

A single TF binds on a single cis element (motif)

Source: U.S. Department of Energy Genomics (http://doegenomestolife.org)

9

Assumption 2

TFs regulate genes sharing a motif only on subset of conditions

TF & regulated genes (group #1)

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

conditione

xp

res

sio

nTF & regulated genes (group #2)

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

condition

ex

pre

ss

ion

10

Expression pattern #1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

conditionex

pre

ssio

nExpression pattern #2

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

condition

exp

ress

ion

Assumption 2 (cont’d)

TFs regulate genes sharing a motif only on subset of conditions

11

Assumption 3The TF expression correlates with the

sum of the partially correlating expression patterns

sum of genes

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

condition

expr

essi

on

12

Objective

• For each cis element (motif):

– discover groups of co-regulated genes

– compute aggregate motif expression

• For each TF:

– find best correlating motifs

13

The algorithm – step 1~

5,0

00 g

en

es

step 1: clustering

25 points

.

.

.

.

.

.

14


5,0

00 g

en

es

step 1: clustering

25 points

step 2 for any motif

compute its gene set

.

.

.

15


5,0

00 g

en

es

step 1 clustering

25 points



step 3 compute the distribution of its genes into the clusters.

.

.

16


5,0

00 g

en

es

step 1 clustering

25 points



step 3 compute the distribution of its genes into the clusters

step 4 determine overrepresented

clusters using t-test

.

.

.

17

The algorithm – final step~

5,0

00 g

en

es

25 points

final stepcompute motif

aggregate expression

25 points

.

.

.

18

Yeast

19

Example TF: BAS1

RANK MOTIF OCCUR corr score 1 gactcg 46 0.6446 66 2 cgagtc 46 0.6446 16 3 gactaa 163 0.6381 66 4 ttagtc 163 0.6381 33 5 tcggct 87 0.6374 33 ... 12 gctagt 110 0.6268 33 13 agtcac 137 0.6262 83 p-value=0.079 ... 27 gagtca 136 0.6192 100 p-value=0.004

Using cis/TF version 1:

20

Example TF: BAS1

Using cis/TF version 2:

RANK MOTIF OCCUR signf corr score 1 ctgact 122 0.62 0.66 33 2 agtcag 122 0.62 0.66 83 3 ggttta 187 0.62 0.63 50 4 taaacc 187 0.62 0.63 33 5 gagtca 136 0.68 0.63 100 p-value=0.002 6 tgactc 136 0.68 0.63 33 7 atttga 378 0.64 0.63 33 8 tcaaat 378 0.64 0.63 50 9 agtggc 126 0.66 0.61 50 10 gccact 126 0.66 0.61 50

21

Cluster #1: correlation = 0.02

-5

-4

-3

-2

-1

0

1

2

3

4

5

1

BAS1

#1

22

Cluster #2: correlation = -0.05

-5

-4

-3

-2

-1

0

1

2

3

4

5

1

BAS1#2

23


-5

-4

-3

-2

-1

0

1

2

3

4

5

1

BAS1#0

24

Cluster #4: correlation = -0.35

-5

-4

-3

-2

-1

0

1

2

3

4

5

1

BAS1#4

25


-5

-4

-3

-2

-1

0

1

2

3

4

5

1

BAS1

#3

26

Conclusions

Advantages of version 2:

gives ability to focus on gene cluster that correlates best with a given TF

thus, increases overall correlation and motif rank

offers a measure of motif significance

can be extended to pairs of TFs/motifs

27

Arabidopsis

28

Procedure• Permute gene cluster assignment

• Compile list of putative motifs

• Compute significance score of known motifs

• Repeat 1000 times

• Compute p-value of the score

0

20

40

60

80

100

120

140

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

ranking score

# o

f experi

men

ts

p-val = 0.006

30

TF discovery?

Need data for training!

(TFs and their associated binding cites)

Parameters to be estimated: number of clusters

motif size & degeneracy

31

Pattern discovery

32

TF-driven pattern discovery

• Unsupervised pattern discovery

• Find groups of genes partially correlating with TF

• Apply statistical filter

• Look for over-represented motifs in genes’ upstream regions

• Data for validation?

33-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

AT1G73230 (TF)

AT1G53290

AT5G59880

34

Pattern discovery example

TF & regulated genes (group #2)

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

condition

expr

essi

on

35

“Predicting Gene Expression form Sequence”Beer & Tavazoie, Cell 2004

• Group genes in 49 clusters• Predict gene cluster using motifs discovered in

its upstream region

36

0.00

0.05

0.10

0.15

0.20

-1 -0.5 0 0.5 1

correlation

freq

uenc

yALL

2,500 genes

PAC

RRPE

PAC&RRPE

37

Conclusions

38

ConlusionsTwo options:

• Supervised training:

– uses background knowledge to construct model

– needs more training data

• Unsupervised pattern discovery:

– minimal model bias (no prior knowledge)

– needs more ‘expert’ help to filter results

cis/tf discovery for arabidopsis aristotelis tsirigos email: [email protected] nyu computer...

Documents