tom kepler santa fe institute normalization and analysis of dna microarray data by self-consistency...
Post on 17-Jan-2016
220 Views
Preview:
TRANSCRIPT
Tom KeplerSanta Fe Institute
Normalization and Analysis
of DNA Microarray Data
by Self-Consistency
and Local Regression
kepler@santafe.edu
Rat mesothelioma cellscontrol
Rat mesothelioma cellstreated with KBrO2
NormalizationMethod to be improved:
1. Assume that some genes will not change under the treatment under investigation.
2. Identify these core genes in advance of the experiment.
3. Normalize all genes against these genes assuming they do not change
NormalizationNew Method:
1. Assume that some genes will not change under the treatment under investigation.
2. Choose these core genes arbitrarily.3. Normalize (provisionally) all genes
against these genes assuming they do not change.
4. Determine which genes do not change under this normalization.
5. Make this set the new core. If this core differs from the previous core, go to 3. Else, done.
I c mRNA [ ]
I = spot intensity[mRNA] = concentration of specific mRNAc = normalization constant
Error Model
I c mRNA [ ]
I = spot intensity[mRNA] = concentration of specific mRNAc = normalization constant = lognormal multiplicative error
Error Model
I c mRNAijk ij ik ijk [ ]
I = spot intensity[mRNA] = concentration of specific mRNAc = normalization constant = lognormal multiplicative error
index 1, i: treatment groupindex 2, j: replicate within treatmentindex 3, k: spot (gene)
Error Model
Y = log spot intensity = mean log concentration of specific mRNA = treatment effect (conc. specific mRNA) = normalization constant = normal additive error
index 1, i: treatment groupindex 2, j: replicate within treatmentindex 3, k: spot (gene)
Identifiability constraints:
Model:
x Y Y
a Y
d Y Y Y Y
k k
ij i ij
ik i i k k i
Estimate by ordinary least squares:
Identifiability constraints:
Model:
But note: cannot identify between and
Self-consistency:
The weight wk() is small if the kth gene is judged to be changed; close to one if it is judged to be unchanged.
Procedure is iterative.
-2 0 2 4 6
log intensity, array 1
-2
0
2
4
6
log
inte
nsi
ty,
arr
ay
2
-2 0 2 4 6
log intensity, array 1
-2
0
2
4
6
log
inte
nsi
ty,
arr
ay
2
Failure of Model
Generalized Model
The normalization ij(k) and the heteroscedasticity
function ij(k) are slowly varying functions
of the intensity, .
Estimate by Local Regression
data
Local Regression
Predict value at x=50: weight, linear regression
Predict whole function similarly
Compare to known true function
Simulation-based Validation1. Reproduce observed bias.
Simulation-based Validation2. Reproduce observed heteroscedasticity.
Test based on z statistic:
21
12
11nn
s
ddz
k
kkk
Choice of significance level:expected number of false positives:
E(false positives) = N
But minimum detectable difference increases as gets smaller
E(fp) min diff min ratio
0.05 250 0.916 2.50.01 50 1.09 30.001 5 1.29 3.60.0001 0.5 1.61 5
Validation of method against simulated data3. Hypothesis testing: Simulated from stated model
Pro
port
ion
chan
ged
spot
s
“-fo
ld c
hang
e”
bias
“rate false pos.” = mean observed / expected
Simulated data: mis-specified model — multiplicative + additive noise
Validation of method against simulated data4. Hypothesis testing: Simulated from “wrong” model: additive + multiplicative noise.
Pro
port
ion
chan
ged
spot
s
“-fo
ld c
hang
e”
bias
Acknowledgments
Lynn CrosbyNorth Carolina State University
Kevin MorganStrategic Toxicological Sciences
GlaxoWellcome
Santa Fe Institute
www.santafe.edu
postdoctoral fellowships available(apply before the end of the year)
kepler@santafe.edu
top related