context-specific bayesian clustering for gene expression data

.

Context-Specific Bayesian Clustering for Gene Expression Data

Yoseph Barash Nir FriedmanSchool of Computer Science & Engineering

Hebrew University

Introduction

New experimental methods abundance of data Gene Expression Genomic sequences Protein levels …

Data analysis methods are crucial for understanding such data

Clustering serves as tool for organizing the data and finding patterns in it

This Talk

New method for clustering Combines different types data Emphasis on learning context-specific

description of the clusters

Application to gene expression data Combine expression data with genomic

information

The DataExperiments

Gen

es

i

j

The mRNA level of gene i in experiment jGoal:

Understand interactions between TF and expression levels

Binding Sites

The # of binding sites of TF j in promotor region of gene i

k

Microarray

Data

Genomic

Data

Simple Clustering Model

attributes are independent given the cluster Simple model computationally cheap Genes are clustered according to both expression

levels and binding sites

Cluster

A1 A2 A3 An… TF1 TF2 TF3 TFk

…

)|()|()|()|()(),,,,,,( 1111 CTPCTPCAPCAPCPCTTAAP knkn

Local Probability Models

Cluster

A1 A2TF1 TF2

C D=0 D=1 D=2 D=3

1 0.30 0.20 0.20 0.302 0.10 0.10 0.30 0.503 0.30 0.20 0.20 0.304 0.60 0.30 0.05 0.055 0.31 0.20 0.19 0.336 0.29 0.20 0.21 0.27

)|( 2 CTFP

C

1 0.10 0.202 0.10 0.503 -0.30 0.064 1.20 0.305 0.76 0.206 -0.23 0.20

)|( 1 CAP

Multinomial

Gaussian

Structure in Local Probability Models

Cluster

A1 A2TF1 TF2

C D=0 D=1 D=2 D=3

1 0.30 0.20 0.20 0.302 0.10 0.10 0.30 0.503 0.30 0.20 0.20 0.304 0.60 0.30 0.05 0.055 0.31 0.20 0.19 0.336 0.29 0.20 0.21 0.27

)|( 2 CTFP

C D=0 D=1 D=2 D=32 0.10 0.10 0.30 0.504 0.60 0.30 0.05 0.05* 0.30 0.20 0.20 0.30

)|( 2 CTFP

Cluster

E1 E2 TF1 TF2

Context Specific Independence

Benefits: Identifies what features characterize each cluster Reduces bias during learning A compact and efficient representation

C D=0 D=1 D=2 D=32 0.10 0.10 0.30 0.504 0.60 0.30 0.05 0.05* 0.30 0.20 0.20 0.30

)|( 2 CTFP

{2,4}{}{1,2,4}

{1,2,3,4,5}

Scoring CSI Cluster Models Represent conditional probabilities with different parametric

families Gaussian, Multinomial, Poisson …

Choose parameters priors from appropriate conjugate prior families

Score:

where

dMPMDPMDP

MPMDPDMP

)|(),|()|(

)()|()|(

MarginalLikelihood Prior

Learning Structure – Naive Approach A hard problem : nodes# , clusters# where

C2 structures# NC

N

“Standard” approach:

C

E1 E2 TF1 TF2

{2,4}{1,2,3}

{}{2}

C D=0 D=1 D=22 0.10 0.10 0.304 0.60 0.30 0.05* 0.30 0.20 0.20

)|( CDP

Learn model parameters using EM

Basic problem – efficiency

Try “nearby” structures and Learn parameters for each one using EM. choose best structure

C

{2,4}{1,2,3}{3}{}

?

E1 E2 TF1 TF2

C

{}{1,2,3}

{3}{2}

?

E1 E2 TF1 TF2

Learning Structure – Structural EM

We can evaluate each edge’s parameters separately given complete data for MAP

we compute EM only once for each iteration

Guaranteed to converge to a local optimum

Learn model parameters using EM

C

E1 E2 TF1 TF2

{2,4}{1,2,3}

{}{2}

C

E1 E2 TF1 TF2

{2,4}

{3}{}

{1,2,3}

?

C

E1 E2 TF1 TF2

{}

{3}{2}

{1,2,3}

?Use the “completed” data to evaluate each edge separately to find best model

Soft assignment for genes

Compute expected sufficient statistics

Gene C1 C2 C31 0.5 0.3 0.22 0.6 0.1 0.33 0.1 0.2 0.74 0.1 0.2 0.75 0.2 0.6 0.2

Results on Synthetic Data Basic approach: Generate data from a known structure Evaluate learned structures for different sample numbers (200 – 800). Add “noise” of unrelated samples to the training set to simulate genes that do

not fall into “nice” functional categories (10-30%). Test learned model for structure as well as for correlation between it’s

tagging and the one given by the original model.Main results: Cluster number: models with fewer clusters were sharply penalized.

Often models with 1-2 additional clusters got similar score , with “degenerate” clusters none of the real samples where classified to.

Structure accuracy: very few false negative edges , 10-20% false positive edges (score dependent)

Mutual information Ratio: max for 800 samples , 100-95% for 500 and 90%~ for 200 samples.

Learned clusters were very informative

Yeast Stress Data (Gasch et al 2001)

Examines response of yeast to stress situations Total 93 arrays We selected ~900 genes that changed in a

selective manner

Treatment steps: Initial clustering Found putative binding sites based on clusters Re-clustered with these sites

Stress Data -- CSI Clusters

CSI Clustersm

ea

n e

xpre

ssio

n le

vel

-2

-1

0

1

2

3

4HSF

HSF varia

ble

diam

ide

H2O2

Men

adion

e

DDTso

rbito

l

Nitrog

en D

ep.

Diauxic

shift

YPDSta

rvat

ion

YP Stead

y

Promoters Analysis

Cluster 3 MIG1 CCCCGC, CGGACC, ACCCCG GAL4 CGGGCC Others CCAATCA

me

an

exp

ress

ion

leve

l

-2

-1

0

1

2

3

4HSF

HSF varia

ble

diam

ide

H2O2

Men

adion

e

DDTso

rbito

l

Nitrog

en D

ep.

Diauxic

shift

YPDSta

rvat

ion

YP Stead

y

Promoters Analysis

Cluster 7 GCN4 TGACTCA Others CGGAAAA, ACTGTGG

me

an

exp

ress

ion

leve

l

-2

-1

0

1

2

3

4HSF

HSF varia

ble

diam

ide

H2O2

Men

adion

e

DDTso

rbito

l

Nitrog

en D

ep.

Diauxic

shift

YPDSta

rvat

ion

YP Stead

y

DiscussionGoals: Identify binding sites/transcription factors Understand interactions among transcription factors

“Combinatorial effects” on expression Predict role/function of the genes

Methods: Integration of model of statistical patterns of binding

sites (see Holmes & Bruno, ISMB’00) Additional dependencies among attributes

Tree augmented Naive Bayes Probabilistic Relational Models (see poster)

context-specific bayesian clustering for gene expression data

Documents

sitesstress data

complete data

synthetic data basic

yeast stress data gasch

edges parameters

structure accuracy

clustersimple model

original model