ming-chih kao, phd university of michigan medical school [email protected]

Integrating Cross-Platform Microarray Data by Second-order Analysis: Functional Annotation and Network Reconstruction

Ming-Chih Kao, PhD

University of Michigan Medical School

[email protected]

Xianghong Jasmine Zhou

Assistant Professor of Biological Sciences

USC

Wing Hung Wong

Professor of Statistics and of Health Research and Policy

Stanford University

2nd-Order AnalysisCurrent Challenges in Microarray Data Analysis1. How to effectively combine the expression

data sets generated with different technology/laboratory platforms?

2. How to identify functionally related genes without co-expression pattern?

3. How to identify transcription cascades?

MicroarrayPlatforms

2nd-Order AnalysisMultiple Microarray Technology Platforms

2nd-Order AnalysisPublic Microarray Data Sources

Experiments Datasets

S. cerevisiae 788 61

C. elegans 348 15

A. thaliana 736 44

M. mus 1,553 20

H. sapiens 4,135 90

TranscriptionFactor 1



gene1

gene2

gene3

gene5

gene4

gene6

gene7

Amplification of signal

?

?

G1

G2

G3

G4

experiments

expression

Cell Cycle Stress Osmotic Starvation Copper Zinc

Experimental groups

Experimental groups

exp. correlation

exp. correlation

First-order correlation

Second-order Correlation

ChromatinSilencing

Amino acidStarvation

GammaRadiation

ProteinMetabolism

DNADamage

HeatSteady

Ex

pre

ss

ion

o

f S

DA

1-C

DC

5

Ex

pre

ss

ion

C

orr

ela

tio

nP

OG

1-M

PT

5,

SD

A1

-CD

C5

Ex

pre

ss

ion

of

PO

G1

-MP

T5

Experimental groups

Regulation of Cell Cycle: POG1-MPT5 and SDA1-CDC5

2nd-Order AnalysisAn Example

Group functionally related genes that may not exhibit similar expression patterns?

Data Stanford Microarray Database (cDNA array) NCBI GEO Database (Affymetrix array) Rosetta Compendium (cDNA array)

39 experimental groups subjected to different (types) of perturbations, such as cell cycle, heat shock, osmotic pressure, starvation, zinc, nitrogen depletion, etc.

2nd-Order AnalysisValidation

43 functional classes

2,429 genes

5,142doublets

278,799 Quadruplets

Homogenous Quadruplets

84%

HeterogeneousQuadruplets

16%

2nd-Order AnalysisValidation: Scheme

2nd-Order AnalysisValidation: Comparison

2nd-Order AnalysisValidation: Results 2nd-order analysis

groups functionally related genes The derived quadruplets

give rise to a set of 2,597 distinct and novel gene pairs

97% of the 2,597 pairs are missed by the standard methods

Reasons for the poor performance of the 1st-order method Inter-dataset variations Cross-doublet gene pairs

need not show high expression correlation

Sensitivity to gene pairs which are only co-expressed in a subset of the data sets

c

a

b

d

e

f

5

Cell Cycle

c

a

b

d

e

f

5

Heat shock Starvation

c

a

b

d

e

f

5

Nitrogen Depletion

c

a

b

d

e

f

5

c

a

b

d

e

f

5

Radiation Osmotic pressure

c

a

b

d

e

f

5

2nd-Order AnalysisInteraction Modules

2nd-Order AnalysisInteraction Modules: Leave-one-out Cross Validation For each gene occurred in the 100 tightest

and most stable clusters of known genes, we masked its function and make prediction based on our 2-step procedure, and check the predicted function and its true function.

We made predictions for 179 doublets, among which 163 are correct

91% success ratio

2nd-Order AnalysisInteraction Modules: Functional Prediction 79 functions of 69 unknown yeast genes

involved in diverse biological processes Experimental studies in the literature and in

our laboratory YLR183C in “mitosis”

Regulation of G1/S transition YLL051C in “cation transport”

Ferric-chelate reductase activity and iron-regulated expression

2nd-Order AnalysisFrequently Occurring Tight Clusters

Transcription Factors

2nd-Order AnalysisFrequently Occurring TCs with 2nd-Order Correlation

Transcription Factors Set 1

Transcription Factor Set 2

Cooperativity

3 types of transcription cascades

2nd-Order AnalysisChIP-Chip

2nd-Order AnalysisTranscription Module Results 60 transcription modules identified 34 pairs showed high 2nd-order correlation 29% (P<10-5) of those modules pairs are participants

in transcription cascades 2 pairs in Type I cascades 8 pairs in Type II cascades 3 pairs in Type III cascades

These transcription cascades inter-connect into a partial cellular regulatory network

Avg

. E

xpre

ssi

on

Le

u3

mo

du

le v

s.

Me

t4 m

od

ule

Avg

. E

xpre

ssi

on

C

orr

ela

tio

nL

eu

3 m

od

ule

vs

. M

et4

mo

du

le

1.0

-1.0

1.0

-1.0

2nd-Order AnalysisLeu3 and Met4 Transcription Cascade

2nd-Order AnalysisHierarchical clustering of transcriptional modules

2nd-Order AnalysisAssigning transcription factor to pathwaysFor an unknown transcription factor in a module cluster, we can annotate its function by integrating 2 types of evidence:

the functions of known genes in its target module

the functions of known transcription factors regulating other modules in the same cluster

2nd-Order AnalysisSummaryA framework to integrate many microarray data sets in a platform-independent way, and investigated its properties and applications.

Group together functionally-related genes without direct expression similarity

Cluster the functional interaction into modules and functional annotation for unknown genes

Reveal the cooperativity in the regulatory network and reconstruct transcription cascades

ming-chih kao, phd university of michigan medical school [email protected]

Documents

order analysisvalidationgroup

data sets cab

related genes

microarray data analysishow

expression data sets

crossplatform microarray

nitrogen depletion cab

novel gene pairs