transformations… what for? which one?

Transformations…

what for?which one?

Wolfgang HuberDiv.Molecular Genome

AnalysisDKFZ Heidelberg

2 2

2 2

( ,...)log

( ,...)

log

log

i i

j j

i

j

i i i

j j j

x f xx f x

xx

x x c

x x c

Log-ratio with/without background correction

Shrunken log-ratio (BHM)

Variance stabilized log-ratio(=generalized log-ratio, “glog”)

Microarray intensities x1,…,xn

How do you like to think about (interprete) it?

How do you estimate the parameters?

What comes out (the “bottom line”?)

ratios and fold changes

Fold changes are useful to describe continuous changes in expression

10001500

3000

x3

x1.5

A B C

0200

3000

?

?

A B C

But what if the gene is “off” (below detection limit) in one condition?


Many interesting genes will be off in some of the conditions of interest

1.If you want expression measure (“net normalized spot intensity”) to be an unbiased estimator of abundance

many values 0 need something more than

(log)ratio

2. If you let expression measure be biased (always>0)

can keep ratios. how do you choose the bias?


Ratios are scale-free:

But there is (at least) one absolute scale in the data:

Can we use this to construct useful functions

( , ) / log( / )i j i j i jf y y y y or y y

( | ( ) 0)i ibg sd Y E Y

( , , ) ?i j bgf y y

How to compare microarray intensities with each other?

How to incorporate measurement uncertainty (“variance”)?

How to simultaneously and consistently deal with calibration (“normalization”)?

In the following:

Sources of variationamount of RNA in the biopsy efficiencies of-RNA extraction-reverse transcription -labeling-fluorescent detection

probe purity and length distributionspotting efficiency, spot sizecross-/unspecific hybridizationstray signal

Calibration Error model

Systematic o similar effect on many measurementso corrections can be estimated from data

Stochastico too random to be ex-plicitely accounted for o remain as “noise”

iik ika a

ai per-sample offset

ik ~ N(0, bi2s1

2)

“additive noise”

bi per-sample normalization factor

bk sequence-wise probe efficiency

ik ~ N(0,s22)

“multiplicative noise”

exp( )iik k ikb b b

ik ik ik ky a b x

modeling ansatz

measured intensity = offset + gain true abundance

The two-component model

raw scale log scale

“additive” noise

“multiplicative” noise

B. Durbin, D. Rocke, JCB 2001

variance stabilizing transformations

Xu a family of random variables with

EXu=u, VarXu=v(u). Define

var f(Xu ) independent of u

1( )

v( )

x

f x duu

derivation: linear approximation

0 20000 40000 60000

8.0

8.5

9.0

9.5

10

.01

1.0

raw scale

tra

nsf

orm

ed

sca

le


f(x)

x


1( )

v( )

x

f x duu

1.) constant variance (‘additive’)

2( ) sv u f u

2.) constant CV (‘multiplicative’)

2( ) logv u u f u

4.) additive and multiplicative

2 2 00( ) ( ) arsinh

u uv u u u s f

s

3.) offset2

0 0( ) ( ) log( )v u u u f u u

the “glog” transformation

intensity-200 0 200 400 600 800 1000

- - - f(x) = log(x)

——— hs(x) = asinh(x/s)

2arsinh( ) log 1

arsinh log log2 0limx

x x x

x x

P. Munson, 2001

D. Rocke & B. Durbin, ISMB 2002

parameter estimationparameter estimation

2Yarsinh , (0, )iki

k ki kii

aN c

b

:

o maximum likelihood estimator: straightforward – but sensitive to deviations from normality

o model holds for genes that are unchanged; differentially transcribed genes act as outliers.

o robust variant of ML estimator, à la Least Trimmed Sum of Squares regression.

o works as long as <50% of genes are differentially transcribed

ii k i k i ka a L a i p e r - s a m p l e o ff s e t

L i k l o c a l b a c k g r o u n d p r o v i d e d b y i m a g e a n a l y s i s

i k ~ N ( 0 , b i2 s 1

2 )

“ a d d i t i v e n o i s e ”

b i p e r - s a m p l en o r m a l i z a t i o n f a c t o r

b k s e q u e n c e - w i s el a b e l i n g e ffi c i e n c y

i k ~ N ( 0 , s 22 )

“ m u l t i p l i c a t i v e n o i s e ”

e x p ( )ii k k i kb b b

i k i k i k i ky a b x

m e a s u r e d i n t e n s i t y = o ff s e t + g a i n * t r u e a b u n d a n c e

Least trimmed sum of squares regression

Least trimmed sum of squares regression

0 2 4 6 8

02

46

8

x

y 2n/2

( ) ( )i=1

( )i iy f x

minimize

- least sum of squares - least trimmed sum of squares

P. Rousseeuw, 1980s

evaluation: effects of different data transformations

evaluation: effects of different data transformations

diff

ere

nce r

ed

-g

reen

rank(average)

Normality: QQ-plot

evaluation: sensitivity / specificity in detecting differential abundance

evaluation: sensitivity / specificity in detecting differential abundance

o Data: paired tumor/normal tissue from 19 kidney cancers, in color flip duplicates on 38 cDNA slides à 4000 genes.

o 6 different strategies for normalization and quantification of differential abundance

o Calculate for each gene & each method: t-statistics, permutation-p

o For threshold , compare the number of genes the different methods find, #{pi | pi}

evaluation: comparison of methodsevaluation: comparison of methods

more accurate quantification of differential expression higher sensitivity / specificity

more accurate quantification of differential expression higher sensitivity / specificity

one-sided test for up one-sided test for down

evaluation: a benchmark for Affymetrix genechip expression measures

o Data: Spike-in series: from Affymetrix 59 x HGU95A, 16 genes, 14 concentrations, complex backgroundDilution series: from GeneLogic 60 x HGU95Av2,liver & CNS cRNA in different proportions and amounts

o Benchmark: 15 quality measures regarding-reproducibility-sensitivity -specificity Put together by Rafael Irizarry (Johns Hopkins) http://affycomp.biostat.jhsph.edu

affycomp results (28 Sep 2003) good

bad

ROC curves

Stratification

i

collaboration with R. Irizarry

25

1

log log ( )i ii

Y x w s

wi

position- and sequence-specific effects wi(s):Naef et al., Phys Rev E 68 (2003)

glog versus "sliding z-score"

sliding z-score

Availability

o implementation in Ro open source package

vsn on www.bioconductor.org

o Bioconductor is an international collaboration on open source software for bioinformatics and statistical omics

What to do with the gene lists: the functional genomics pipeline @

DKFZ

High-throughput transcripto

me sequencing: clones with unannotated full length

ORFs

functional characterizati

on

Neoplastic diseases

association of mRNA profiles with- genetic aberrations- histopathology- clinical behavior

HT functional assays (S. Wiemann, D. Arlt)

Library of "unknow

n" transcrip

ts expression cloneBrdU

incorporation

GFP-ORF- protein

DAPI: identification CFP: expression

BrdU: proliferation

Image segmentation

and quantification

proliferation

+activator -inhibitor

automated

microscopeRainer Pepperkok, EMBL

DAPI ORF-YFP Anti-BrdU/Cy5

Detection of modulators of cell proliferation

overlay

YFP – Cy5

Dorit Arlt

YFP channel Cy5 channel 72.0 761.0 71.0 684.1 119.7 779.0 87.3 820.2 149.5 645.6 70.2 536.1 84.7 799.5 103.1 912.8 81.0 916.7 2621.8 267.6 74.1 766.2 156.8 866.6 169.0 819.8 105.5 757.7 156.0 367.8 76.5 746.2 135.2 731.2 86.2 567.3 77.7 896.3 92.6 1095.4 104.6 633.3 481.2 567.7 539.0 663.9 95.0 726.2 156.7 842.1

Measurement of fluorescence intensities

68.5,

231.6,

80.9,

-4.8

by automated image analysis

SMPCell

http://lui.embl-heidelberg.de/hcs-preview/dorit6_1/08-pdYFPhamy2_1p4_1E2Y1MW%20W08P03SL00Q5F0.jpg



http://lui.embl-heidelberg.de/hcs-preview/dorit6_1/RGB/08-pdYFPhamy2_1p4_1E2Y1MW%20W08P03SL00Q5F0-rgb.jpg

activ

atio

n in

hib

ition

Statistical analysis of cellular assay data

0 50 100 150 200 250

0.0

00

0.0

02

0.0

04

0.0

06

0.0

08

0.0

10

0.0

12

brdU

transfected cells

control cells

1 2 3 4 5 6 7 8 9 10 11 12

A

B

C

D

E

F

G

H

dorit6

-6-4

-20

24

6

detect transfection

effect:

inh (p=10-8)

Plate summary plot

Cellular assays: challenges for statisticians

o Image analysis:

pattern recognition, classification

o Low-level analysis

what are good models for calibration, “normalization”, data transformation?

o High-level analysis

models for the dependence of cellular processes on over-/underexpression of genes

connect results from different assays, microarray data

Summaryo log-ratio log: what about genes that are not expressed in some of

the conditions of interest?

o generalized log-ratio h: a useful extrapolation- interpretability- sensitivity- specificity- computational convenience

o what to do with the gene lists? systematic (high throughput) functional assays

Acknowledgements

DKFZ HeidelbergMolecular Genome

Analysis

Annemarie PoustkaHolger SültmannAndreas BuneßMarkus RuschhauptKatharina FinisJörg SchneiderKlaus Steiner

Stefan WiemannDorit Arlt

MPI Molekulare GenetikAnja von HeydebreckMartin Vingron

Uni HeidelbergGünther Sawitzki

DFCI HarvardRobert Gentleman

UMC LeidenJudith Boer

RZPDAnke Schroth Bernd Korn

EMBLUrban Liebel

...and many more!

Models are never correct, but some are useful

True relationship:

1 2 22

(0, 0.15 )y x x N

Model: linear dependence

Model: quadratic dependence

raw scale log glog

difference

log-ratio

generalized

log-ratio

constant partvariance:

proportional part

variance stabilization

ratio compression

Yue et al.,

(Incyte Genomics

) NAR (2001) 29

e41

transformations… what for? which one?

Documents

expression ratios

number of genes

differential abundanceo

generalized logratio

transcribed genes act

trimmed sum of squaresp

transformationsfxx variance

different strategies