context-specific bayesian clustering for gene expression data
DESCRIPTION
Context-Specific Bayesian Clustering for Gene Expression Data. Yoseph Barash Nir Friedman School of Computer Science & Engineering Hebrew University. Introduction. New experimental methods abundance of data Gene Expression Genomic sequences Protein levels … - PowerPoint PPT PresentationTRANSCRIPT
.
Context-Specific Bayesian Clustering for Gene Expression Data
Yoseph Barash Nir FriedmanSchool of Computer Science & Engineering
Hebrew University
Introduction
New experimental methods abundance of data Gene Expression Genomic sequences Protein levels …
Data analysis methods are crucial for understanding such data
Clustering serves as tool for organizing the data and finding patterns in it
This Talk
New method for clustering Combines different types data Emphasis on learning context-specific
description of the clusters
Application to gene expression data Combine expression data with genomic
information
The DataExperiments
Gen
es
i
j
The mRNA level of gene i in experiment jGoal:
Understand interactions between TF and expression levels
Binding Sites
The # of binding sites of TF j in promotor region of gene i
k
Microarray
Data
Genomic
Data
Simple Clustering Model
attributes are independent given the cluster Simple model computationally cheap Genes are clustered according to both expression
levels and binding sites
Cluster
A1 A2 A3 An… TF1 TF2 TF3 TFk
…
)|()|()|()|()(),,,,,,( 1111 CTPCTPCAPCAPCPCTTAAP knkn
Local Probability Models
Cluster
A1 A2TF1 TF2
C D=0 D=1 D=2 D=3
1 0.30 0.20 0.20 0.302 0.10 0.10 0.30 0.503 0.30 0.20 0.20 0.304 0.60 0.30 0.05 0.055 0.31 0.20 0.19 0.336 0.29 0.20 0.21 0.27
)|( 2 CTFP
C
1 0.10 0.202 0.10 0.503 -0.30 0.064 1.20 0.305 0.76 0.206 -0.23 0.20
)|( 1 CAP
Multinomial
Gaussian
Structure in Local Probability Models
Cluster
A1 A2TF1 TF2
C D=0 D=1 D=2 D=3
1 0.30 0.20 0.20 0.302 0.10 0.10 0.30 0.503 0.30 0.20 0.20 0.304 0.60 0.30 0.05 0.055 0.31 0.20 0.19 0.336 0.29 0.20 0.21 0.27
)|( 2 CTFP
C D=0 D=1 D=2 D=32 0.10 0.10 0.30 0.504 0.60 0.30 0.05 0.05* 0.30 0.20 0.20 0.30
)|( 2 CTFP
Cluster
E1 E2 TF1 TF2
Context Specific Independence
Benefits: Identifies what features characterize each cluster Reduces bias during learning A compact and efficient representation
C D=0 D=1 D=2 D=32 0.10 0.10 0.30 0.504 0.60 0.30 0.05 0.05* 0.30 0.20 0.20 0.30
)|( 2 CTFP
{2,4}{}{1,2,4}
{1,2,3,4,5}
Scoring CSI Cluster Models Represent conditional probabilities with different parametric
families Gaussian, Multinomial, Poisson …
Choose parameters priors from appropriate conjugate prior families
Score:
where
dMPMDPMDP
MPMDPDMP
)|(),|()|(
)()|()|(
MarginalLikelihood Prior
Learning Structure – Naive Approach A hard problem : nodes# , clusters# where
C2 structures# NC
N
“Standard” approach:
C
E1 E2 TF1 TF2
{2,4}{1,2,3}
{}{2}
C D=0 D=1 D=22 0.10 0.10 0.304 0.60 0.30 0.05* 0.30 0.20 0.20
)|( CDP
Learn model parameters using EM
Basic problem – efficiency
Try “nearby” structures and Learn parameters for each one using EM. choose best structure
C
{2,4}{1,2,3}{3}{}
?
E1 E2 TF1 TF2
C
{}{1,2,3}
{3}{2}
?
E1 E2 TF1 TF2
Learning Structure – Structural EM
We can evaluate each edge’s parameters separately given complete data for MAP
we compute EM only once for each iteration
Guaranteed to converge to a local optimum
Learn model parameters using EM
C
E1 E2 TF1 TF2
{2,4}{1,2,3}
{}{2}
C
E1 E2 TF1 TF2
{2,4}
{3}{}
{1,2,3}
?
C
E1 E2 TF1 TF2
{}
{3}{2}
{1,2,3}
?Use the “completed” data to evaluate each edge separately to find best model
Soft assignment for genes
Compute expected sufficient statistics
Gene C1 C2 C31 0.5 0.3 0.22 0.6 0.1 0.33 0.1 0.2 0.74 0.1 0.2 0.75 0.2 0.6 0.2
Results on Synthetic Data Basic approach: Generate data from a known structure Evaluate learned structures for different sample numbers (200 – 800). Add “noise” of unrelated samples to the training set to simulate genes that do
not fall into “nice” functional categories (10-30%). Test learned model for structure as well as for correlation between it’s
tagging and the one given by the original model.Main results: Cluster number: models with fewer clusters were sharply penalized.
Often models with 1-2 additional clusters got similar score , with “degenerate” clusters none of the real samples where classified to.
Structure accuracy: very few false negative edges , 10-20% false positive edges (score dependent)
Mutual information Ratio: max for 800 samples , 100-95% for 500 and 90%~ for 200 samples.
Learned clusters were very informative
Yeast Stress Data (Gasch et al 2001)
Examines response of yeast to stress situations Total 93 arrays We selected ~900 genes that changed in a
selective manner
Treatment steps: Initial clustering Found putative binding sites based on clusters Re-clustered with these sites
Stress Data -- CSI Clusters
CSI Clustersm
ea
n e
xpre
ssio
n le
vel
-2
-1
0
1
2
3
4HSF
HSF varia
ble
diam
ide
H2O2
Men
adion
e
DDTso
rbito
l
Nitrog
en D
ep.
Diauxic
shift
YPDSta
rvat
ion
YP Stead
y
Promoters Analysis
Cluster 3 MIG1 CCCCGC, CGGACC, ACCCCG GAL4 CGGGCC Others CCAATCA
me
an
exp
ress
ion
leve
l
-2
-1
0
1
2
3
4HSF
HSF varia
ble
diam
ide
H2O2
Men
adion
e
DDTso
rbito
l
Nitrog
en D
ep.
Diauxic
shift
YPDSta
rvat
ion
YP Stead
y
Promoters Analysis
Cluster 7 GCN4 TGACTCA Others CGGAAAA, ACTGTGG
me
an
exp
ress
ion
leve
l
-2
-1
0
1
2
3
4HSF
HSF varia
ble
diam
ide
H2O2
Men
adion
e
DDTso
rbito
l
Nitrog
en D
ep.
Diauxic
shift
YPDSta
rvat
ion
YP Stead
y
DiscussionGoals: Identify binding sites/transcription factors Understand interactions among transcription factors
“Combinatorial effects” on expression Predict role/function of the genes
Methods: Integration of model of statistical patterns of binding
sites (see Holmes & Bruno, ISMB’00) Additional dependencies among attributes
Tree augmented Naive Bayes Probabilistic Relational Models (see poster)