transformations… what for? which one?
DESCRIPTION
Transformations… what for? which one?. Wolfgang Huber Div.Molecular Genome Analysis DKFZ Heidelberg. Microarray intensities x 1 ,…,x n. Log-ratio with/without background correction Shrunken log-ratio (BHM) Variance stabilized log-ratio (=generalized log-ratio, “glog”). - PowerPoint PPT PresentationTRANSCRIPT
Transformations…
what for?which one?
Wolfgang HuberDiv.Molecular Genome
AnalysisDKFZ Heidelberg
2 2
2 2
( ,...)log
( ,...)
log
log
i i
j j
i
j
i i i
j j j
x f xx f x
xx
x x c
x x c
Log-ratio with/without background correction
Shrunken log-ratio (BHM)
Variance stabilized log-ratio(=generalized log-ratio, “glog”)
Microarray intensities x1,…,xn
How do you like to think about (interprete) it?
How do you estimate the parameters?
What comes out (the “bottom line”?)
ratios and fold changes
Fold changes are useful to describe continuous changes in expression
10001500
3000
x3
x1.5
A B C
0200
3000
?
?
A B C
But what if the gene is “off” (below detection limit) in one condition?
ratios and fold changes
Many interesting genes will be off in some of the conditions of interest
1.If you want expression measure (“net normalized spot intensity”) to be an unbiased estimator of abundance
many values 0 need something more than
(log)ratio
2. If you let expression measure be biased (always>0)
can keep ratios. how do you choose the bias?
ratios and fold changes
Ratios are scale-free:
But there is (at least) one absolute scale in the data:
Can we use this to construct useful functions
( , ) / log( / )i j i j i jf y y y y or y y
( | ( ) 0)i ibg sd Y E Y
( , , ) ?i j bgf y y
How to compare microarray intensities with each other?
How to incorporate measurement uncertainty (“variance”)?
How to simultaneously and consistently deal with calibration (“normalization”)?
In the following:
Sources of variationamount of RNA in the biopsy efficiencies of-RNA extraction-reverse transcription -labeling-fluorescent detection
probe purity and length distributionspotting efficiency, spot sizecross-/unspecific hybridizationstray signal
Calibration Error model
Systematic o similar effect on many measurementso corrections can be estimated from data
Stochastico too random to be ex-plicitely accounted for o remain as “noise”
iik ika a
ai per-sample offset
ik ~ N(0, bi2s1
2)
“additive noise”
bi per-sample normalization factor
bk sequence-wise probe efficiency
ik ~ N(0,s22)
“multiplicative noise”
exp( )iik k ikb b b
ik ik ik ky a b x
modeling ansatz
measured intensity = offset + gain true abundance
The two-component model
raw scale log scale
“additive” noise
“multiplicative” noise
B. Durbin, D. Rocke, JCB 2001
variance stabilizing transformations
Xu a family of random variables with
EXu=u, VarXu=v(u). Define
var f(Xu ) independent of u
1( )
v( )
x
f x duu
derivation: linear approximation
0 20000 40000 60000
8.0
8.5
9.0
9.5
10
.01
1.0
raw scale
tra
nsf
orm
ed
sca
le
variance stabilizing transformations
f(x)
x
variance stabilizing transformations
1( )
v( )
x
f x duu
1.) constant variance (‘additive’)
2( ) sv u f u
2.) constant CV (‘multiplicative’)
2( ) logv u u f u
4.) additive and multiplicative
2 2 00( ) ( ) arsinh
u uv u u u s f
s
3.) offset2
0 0( ) ( ) log( )v u u u f u u
the “glog” transformation
intensity-200 0 200 400 600 800 1000
- - - f(x) = log(x)
——— hs(x) = asinh(x/s)
2arsinh( ) log 1
arsinh log log2 0limx
x x x
x x
P. Munson, 2001
D. Rocke & B. Durbin, ISMB 2002
parameter estimationparameter estimation
2Yarsinh , (0, )iki
k ki kii
aN c
b
:
o maximum likelihood estimator: straightforward – but sensitive to deviations from normality
o model holds for genes that are unchanged; differentially transcribed genes act as outliers.
o robust variant of ML estimator, à la Least Trimmed Sum of Squares regression.
o works as long as <50% of genes are differentially transcribed
ii k i k i ka a L a i p e r - s a m p l e o ff s e t
L i k l o c a l b a c k g r o u n d p r o v i d e d b y i m a g e a n a l y s i s
i k ~ N ( 0 , b i2 s 1
2 )
“ a d d i t i v e n o i s e ”
b i p e r - s a m p l en o r m a l i z a t i o n f a c t o r
b k s e q u e n c e - w i s el a b e l i n g e ffi c i e n c y
i k ~ N ( 0 , s 22 )
“ m u l t i p l i c a t i v e n o i s e ”
e x p ( )ii k k i kb b b
i k i k i k i ky a b x
m e a s u r e d i n t e n s i t y = o ff s e t + g a i n * t r u e a b u n d a n c e
Least trimmed sum of squares regression
Least trimmed sum of squares regression
0 2 4 6 8
02
46
8
x
y 2n/2
( ) ( )i=1
( )i iy f x
minimize
- least sum of squares - least trimmed sum of squares
P. Rousseeuw, 1980s
evaluation: effects of different data transformations
evaluation: effects of different data transformations
diff
ere
nce r
ed
-g
reen
rank(average)
Normality: QQ-plot
evaluation: sensitivity / specificity in detecting differential abundance
evaluation: sensitivity / specificity in detecting differential abundance
o Data: paired tumor/normal tissue from 19 kidney cancers, in color flip duplicates on 38 cDNA slides à 4000 genes.
o 6 different strategies for normalization and quantification of differential abundance
o Calculate for each gene & each method: t-statistics, permutation-p
o For threshold , compare the number of genes the different methods find, #{pi | pi}
evaluation: comparison of methodsevaluation: comparison of methods
more accurate quantification of differential expression higher sensitivity / specificity
more accurate quantification of differential expression higher sensitivity / specificity
one-sided test for up one-sided test for down
evaluation: a benchmark for Affymetrix genechip expression measures
o Data: Spike-in series: from Affymetrix 59 x HGU95A, 16 genes, 14 concentrations, complex backgroundDilution series: from GeneLogic 60 x HGU95Av2,liver & CNS cRNA in different proportions and amounts
o Benchmark: 15 quality measures regarding-reproducibility-sensitivity -specificity Put together by Rafael Irizarry (Johns Hopkins) http://affycomp.biostat.jhsph.edu
affycomp results (28 Sep 2003) good
bad
ROC curves
Stratification
i
collaboration with R. Irizarry
25
1
log log ( )i ii
Y x w s
wi
position- and sequence-specific effects wi(s):Naef et al., Phys Rev E 68 (2003)
glog versus "sliding z-score"
sliding z-score
Availability
o implementation in Ro open source package
vsn on www.bioconductor.org
o Bioconductor is an international collaboration on open source software for bioinformatics and statistical omics
What to do with the gene lists: the functional genomics pipeline @
DKFZ
High-throughput transcripto
me sequencing: clones with unannotated full length
ORFs
functional characterizati
on
Neoplastic diseases
association of mRNA profiles with- genetic aberrations- histopathology- clinical behavior
HT functional assays (S. Wiemann, D. Arlt)
Library of "unknow
n" transcrip
ts expression cloneBrdU
incorporation
GFP-ORF- protein
DAPI: identification CFP: expression
BrdU: proliferation
Image segmentation
and quantification
proliferation
+activator -inhibitor
automated
microscopeRainer Pepperkok, EMBL
DAPI ORF-YFP Anti-BrdU/Cy5
Detection of modulators of cell proliferation
overlay
YFP – Cy5
Dorit Arlt
YFP channel Cy5 channel 72.0 761.0 71.0 684.1 119.7 779.0 87.3 820.2 149.5 645.6 70.2 536.1 84.7 799.5 103.1 912.8 81.0 916.7 2621.8 267.6 74.1 766.2 156.8 866.6 169.0 819.8 105.5 757.7 156.0 367.8 76.5 746.2 135.2 731.2 86.2 567.3 77.7 896.3 92.6 1095.4 104.6 633.3 481.2 567.7 539.0 663.9 95.0 726.2 156.7 842.1
Measurement of fluorescence intensities
68.5,
231.6,
80.9,
-4.8
by automated image analysis
SMPCell
activ
atio
n in
hib
ition
Statistical analysis of cellular assay data
0 50 100 150 200 250
0.0
00
0.0
02
0.0
04
0.0
06
0.0
08
0.0
10
0.0
12
brdU
transfected cells
control cells
1 2 3 4 5 6 7 8 9 10 11 12
A
B
C
D
E
F
G
H
dorit6
-6-4
-20
24
6
detect transfection
effect:
inh (p=10-8)
Plate summary plot
Cellular assays: challenges for statisticians
o Image analysis:
pattern recognition, classification
o Low-level analysis
what are good models for calibration, “normalization”, data transformation?
o High-level analysis
models for the dependence of cellular processes on over-/underexpression of genes
connect results from different assays, microarray data
Summaryo log-ratio log: what about genes that are not expressed in some of
the conditions of interest?
o generalized log-ratio h: a useful extrapolation- interpretability- sensitivity- specificity- computational convenience
o what to do with the gene lists? systematic (high throughput) functional assays
Acknowledgements
DKFZ HeidelbergMolecular Genome
Analysis
Annemarie PoustkaHolger SültmannAndreas BuneßMarkus RuschhauptKatharina FinisJörg SchneiderKlaus Steiner
Stefan WiemannDorit Arlt
MPI Molekulare GenetikAnja von HeydebreckMartin Vingron
Uni HeidelbergGünther Sawitzki
DFCI HarvardRobert Gentleman
UMC LeidenJudith Boer
RZPDAnke Schroth Bernd Korn
EMBLUrban Liebel
...and many more!
Models are never correct, but some are useful
True relationship:
1 2 22
(0, 0.15 )y x x N
Model: linear dependence
Model: quadratic dependence
raw scale log glog
difference
log-ratio
generalized
log-ratio
constant partvariance:
proportional part
variance stabilization
ratio compression
Yue et al.,
(Incyte Genomics
) NAR (2001) 29
e41