theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays...

33
Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina, Chapel Hill Division of Human Cancer Genetics Ohio State University William J. Lemon, Jeffrey J.T. Palatini, Ralf Krahe, Fred A. Wright

Post on 21-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays

Department of Biostatistics, University pf North Carolina, Chapel Hill

Division of Human Cancer GeneticsOhio State University

William J. Lemon, Jeffrey J.T. Palatini, Ralf Krahe, Fred A. Wright

Page 2: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

Measuring gene expression with the Affymetrix GeneChip

Perfect Match (PM)

Mismatch (MM)

PM - 25 bases complementary to region of gene

MM - Middle base is different

...

Coding portion of gene X polyA

•cRNA from sample mRNA is put on the chip

•intensity of binding reflects gene expression

Page 3: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

Reproducibility of Probe Sensitivities

Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, 2001.

Page 4: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

The Li-Wong Model

Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, 2001.

Li-Wong Full (LWF)

Li-Wong Reduced (LWR)

),0(~

,2

Ne

eMM

ePM

ijjij

ijijjij

222 2),,0(~

,

N

MMPMy ijijijij

Identifiability constraint j

j J2

Page 5: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

The Li-Wong Model

Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, 2001.

Li-Wong Full (LWF)

Li-Wong Reduced (LWR)

),0(~

,2

Ne

eMM

ePM

ijjij

ijijjij

222 2),,0(~

,

N

MMPMy ijijijij

Identifiability constraint j

j J2

ith array

jth probe pair

Total no. probe pairs

Page 6: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

The Li-Wong Model

Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, 2001.

Li-Wong Full (LWF)

Li-Wong Reduced (LWR)

),0(~

,2

Ne

eMM

ePM

ijjij

ijijjij

222 2),,0(~

,

N

MMPMy ijijijij

Identifiability constraint j

j J2

ith array

jth probe pair

Total no. probe pairs

expression

sensitivities

Page 7: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

How to compare gene expression indexes?

•We get maximum likelihood estimates for using either full data (LWF) or reduced data (LWR)

•The Affymetrix software computes:

Average Difference (AD)

Log-Average (LA)

•The log-average might perform particularly poorly. Note that if terms are small and error variance is small,

.ˆ j

j Jy

JMMPMj

jj /)/log(10

)/()()/()()/( jjjjjjjj MMPM

Page 8: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

•We gain insight by assuming Li-Wong model is true. Then what are the consequences?

•For large sample sizes, the ’s and ’s will be well-estimated

Page 9: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

Compare LW estimators directly:

0.2)(

2)ˆvar(

)ˆvar(),(

22

JreducedfullRE j

jjj

j

full

reduced

Comparing to AD is tricky, but with a correction factor AD is also an unbiased estimate of :

ˆˆ

jjJ

0.1)var(1

1

)ˆvar(

)ˆvar(),(

reduced

ADreducedRE

Page 10: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

•This also gives insight into “perfect match only” analyses:

RE(full, PM-only)=

jjj

jj

full

PM2

2

)(1

)ˆvar(

)ˆvar(

21 REand

Furthermore, PM-only is always at least twice as efficient as LWR

Page 11: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

Empirical Comparisons

•We propose that an expression index is “good” if it has a high correlation with the underlying true expression (which is usually unknown).

•this correlation can be estimated using a specially designed mixing experiment

•if r is the correlation coefficient between the measured index and true expression, the “relative efficiency” of two indexes and can be estimated as

)1/(

)1/(22

22

rr

rr

Page 12: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

),0(~,ˆ 210 Nee

).,0(~,ˆ 210 Nee

Suppose the true underlying gene expression for a given gene is . Consider two indices of gene expression

10 /)ˆ(ˆ is an unbiased estimate of

21

2 /)ˆ

var(

21

2

21

2

/

/

var(

)ˆvar()ˆ,ˆ(

RE

And we have

Page 13: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

Can we estimate this relative efficiency?

•Suppose we could do a regression of on .

•the ratio of explained to residual variance in the model can be shown to be

2

222

11

/)var(

r

r

)ˆ,ˆ()1/(

)1/(22

22

RErr

rr

and similarly for , so

Page 14: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

Can we estimate r without ever knowing true expressions ?

•Yes, with a specially designed mixing experiment

•we seek two contrasting conditions in which many genes will be differentially expressed

Page 15: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

Experimental Design

Human Fibroblasts(GM 08330)

20% FBS

48h

24hHarvest total RNA

Lys, PheDap, Thr

50:50

Add Bacterial Control Genes

StimulatedStarved

5 passages

Dap, Thr,Lys, Phe

Produce 50:50 group

Produce duplicates each day for 3d

Synthesize cDNA, cRNA; fragment

Add Hybridization Control Genes

BioB, BioC, BioD, Cre

Hybridize HuGeneFL

0.1% FBS

Serum starvation

Cell culture

Serum stimulation0.1%

20%

Harvest total RNA

Gene Expression IndexesData Reduction

RNA extraction

20% FBS

(6 replicates for each condition)

Page 16: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

BIN1 expression

Stim 50:50 Starved

True expression = average of Stim, Starved

full

Page 17: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

BIN1 expression

Stim 50:50 Starved

full

1 2 3

Page 18: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

X

X

r

or

r

rr

Note that

Where X=1, 2, 3 (say) for Stim, 50:50 Starved, respectively

Page 19: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

Mean probe intensity per array

Stim 50:50 Starved

Overall intensity higher in Stimulated

Page 20: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

Coefficients of variation for assay (individual probes) and gene expression indexes

0.0 0.5 1.0 1.5 2.0

02

00

00

60

00

01

00

00

0

Assay Stim

CV

# P

rob

es

0.121

0.0 0.5 1.0 1.5 2.0

05

00

10

00

15

00

20

00

25

00

LWF Stim

CV

# g

en

es

0.149

0.0 0.5 1.0 1.5 2.0

02

00

40

06

00

80

0

Affymetrix AD Stim

CV

# g

en

es

0.293

Page 21: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

Stim 50:50 Starved Stim 50:50 Starved

Stim

50:50

Starved

Stim

50:50

Starved

LWF

AD

LWR

LA

Correlation matrix of 18 arrays as a colorized image for each expression index.

Page 22: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

Comparing ModelsCluster Analysis

Affymetrix Log Ave

Full Model Reduced Model

Affymetrix Ave Diff

Str

v 1

Str

v 4

Str

v 2

Str

v 5

Str

v 3

Str

v 6

50

:50

35

0:5

0 5

50

:50

45

0:5

0 2

50

:50

15

0:5

0 6

Sti

m 4

Sti

m 6

Sti

m 5

Sti

m 3

Sti

m 1

Sti

m 2

Sti

m 2

Str

v 1

Str

v 3

Str

v 2

Str

v 6

Str

v 5

Str

v 4

Sti

m 1

Sti

m 6

Sti

m 3

Sti

m 5

Sti

m 4

50

:50

55

0:5

0 4

50

:50

35

0:5

0 2

50

:50

15

0:5

0 6

Str

v 3

Str

v 4

Str

v 6

Str

v 5

Str

v 2

Str

v 1

Sti

m 2

Sti

m 1

Sti

m 4

Sti

m 5

Sti

m 6

Sti

m 3

50

:50

55

0:5

0 4

50

:50

25

0:5

0 1

50

:50

65

0:5

0 3

Str

v 2

Str

v 3

Str

v 1

Str

v 6

Str

v 5

Str

v 4

Sti

m 2

Sti

m 4

50

:50

1S

tim

1S

tim

6S

tim

3S

tim

55

0:5

0 3

50

:50

55

0:5

0 4

50

:50

25

0:5

0 6

Page 23: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

Relative Efficiency

0.0

0.5

1.0

1.5

LWF

LWR

AD LA

Med

ian(

r2 /(1

-r2 )

)

LWF

LWR

AD LA

Unscaled Scaled

Page 24: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

Correlation of duplicate measurements of 149 genes

LWF median r=.74

LWR median r=.43

AD median r=.08

LA median r=.17

Page 25: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

Number of unexpressed genes•Only 0.2% of the LW estimates are negative

•50:50 group has fewest negative estimates

•could this indicate very few unexpressed genes?

Stim 50:50 Starved

Page 26: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

A conservative approach to estimating number of unexpressed genes

•Let U denote number of unexpressed genes

•genes are ranked according to expression index

)genes all amonggenesofrankmedian(2 UU

•This is useful if we can get a random sample of unexpressed genes

Unexpressed population

Gene expression index

Page 27: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

•We use the spiked-out bacterial control genes as a sample of “unexpressed” genes

•the 4 genes are are represented 3 times each (different portions of mRNA), for a total of 12 probe sets

•Based on this reasoning, we estimate that greater than 88% of the genes are expressed, even in the Starved samples

Page 28: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

Rank of expression index variance across the 6 Stimulated arrays versus rank of index mean

Truly absent in stim group

Rank(mean)

Ran

k(va

r)

0 2000 4000 6000

020

0040

0060

00

Rank(mean)

Ran

k(va

r)

0 2000 4000 6000

2000

4000

6000

DapThrPheLys

ADLWF

Very low estimated expression for truly absent genes when using LWF

Page 29: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

Present/absent calls

•We use the statistic

)ˆ(

ˆ

SEz

to declare genes present/absent (absolute call)

•we find the vast majority of genes on the array appear to be present

•for the spiked in/out genes, we find vastly improved present/absent calling using LW estimates

Page 30: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

False Positive Rate0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

(1 - Specificity)

(Sensitiv

ity)

1 -

Fals

e N

egative R

ate LWF-Z

LWR-Z

Untrimmed AD

Untrimmed LA

LA

AD

Absolute Call

ROC curve - spiked in/out genes

Page 31: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

Variability in estimates

Full Model Reduced Model

log(

vari

ance

)

log(mean)

Stim

50:50

Starved

Page 32: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

Conclusions

• Model-based estimators are superior to simple averaging• Full model superior to reduced• this does not necessarily mean that the mismatch probes

are a good idea - but if they are present we should use them

• we have demonstrated this using both analytic considerations and experimental data

• a carefully designed experiment can be used to address many issues

• Many more genes may be expressed than previously thought

Page 33: Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

Other issues/ future work

•Spiking genes might be used to calibrate and normalize arrays

•relationship between variance and mean of expression indexes may be useful in planning experiments

•our data may be useful for future work, especially in producing indexes that are resistant to probe saturation

•all primary data, this Powerpoint presentation and a preprint are available at http://thinker.med.ohio-state.edu