statistical methods for interpreting microarray data · (of statistical methods for…) low level...

Post on 30-Sep-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Statistical methods forinterpreting microarray data

Terry SpeedDepartment of Statistics, UC Berkeley

Walter & Eliza Hall Institute of Medical Research

Workshop on Molecular and Statistical Genomic EpidemiologyParis, May 9-11, 2005

2

My plan today: two illustrations(of Statistical methods for…)

Low level analysis: calling genotypes from AffymetrixSNP chip data.

Similar projects are underway for analysing chip datafor DNA copy number determination, DNAresequencing, whole genome tiling arrays for globalexpression and ChIP-chip studies, and whole genomeexon arrays for exon and gene-level expression. AlsoQA/QC.

Higher level analysis: one experiment to identifygenes involved in the host response to Leishmaniamajor, not atypical of the special experiments we do.

3

No time to mention today

Middle level analysis, many examples, e.g.the summarization and ranking of genesusing microarray time course data.

4

The Affymetrix SNP Chip

1.28cm > 100,000 features / array

1.28cm

88µµmm

8µm

> 1million of identical 25bp probes / feature

* **

**

5

TAGCCATCGGTANGTACTCAATGAT

Genomic DNA

ATCGGTAGCCATTCATGAGTTACTAPerfect Match probe for Allele A

ATCGGTAGCCATCCATGAGTTACTAPerfect Match probe for Allele B

A SNP

GTAGCCATCGGTA GTACTCAATGAT

Affymetrix SNP chip terminology

6

Affymetrix SNP probe tiling strategy, 1SNP Tiling Strategy

TAGCCATCGGTA N

SNP 0 Position

A / G

GTA C TCAATGATCAGCT

ATCGGTAGCCAT T

ATCGGTAGCCAT CATCGGTAGCCAT A

ATCGGTAGCCAT ACAT G AGTTACTACAT G AGTTACTA

CAT G AGTTACTACAT G AGTTACTA

PM AlleleMM Allele

PM AlleleMM Allele

AA

BB

Central probe quartet

7

Affymetrix SNP probe tiling strategy, 2

TAGCCATCGGTA N

SNP

+4 PositionA / G

GTA C TCAATGATCAGCT

GTAGCCAT T

GTAGCCAT CGTAGCCAT C

GTAGCCAT TCAT G AGTTACTAGTCGCAT C AGTTACTAGTCG

CAT G AGTTACTAGTCGCAT C AGTTACTAGTCG

PMMM

PMMM

AA

B B

+4 Allele+4 Allele

+4 Allele+4 Allele

+4 offset probe quartet

8

Affymetrix SNP probe tiling strategy, 3

MMBMMBMMBMMBMMBMMBMMB

PMBPMBPMBPMBPMBPMBPMB

MMAMMAMMAMMAMMAMMAMMA

PMAPMAPMAPMAPMAPMAPMA

7654321

Repeated on the opposite strand: 56 probes in all.More recently, 40: just 4 offset quartets instead of 6.

Central quartetOffset quartets Offset quartets

9

Affymetrix SNP identificationFake (idealized) image for 3 samples on one SNP

AA AB BB

The current vendor-supplied genotype-calling algorithm DM seeks the best fitting pattern of the above kind, including nocall (NC). It is a mix of normal likelihood-based model selectionand a Wilcoxon test. There is no training, and it is single chip.

10

DM (no NCs) vs HapMap

1,452327,4151,13225BB

1,745544355,168457AB

1,42091,249339,502AA

NCBBABAAHapMapDM

11,446 SNPs, 90 samples99.67% concordance (both called) 3,416 discordant calls

11

Why attempt an improvement over DM?

• Perhaps the error rate is too high?

• There is reason to believe it can be improved bya) using the training/test set paradigm;b) carrying out multi-chip analyses, which identifyand exploit probe behaviour; andc) exploiting the massive parallelism across SNPs.

• The 100K SNPs were selected from a much largerscreening set using DM. For the 500K and >1M SNPchips, a higher yield is desirable, and perhaps abetter genotype-calling algorithm could achieve this.

12

Robust Linear Model with theMahalanobis distance classifier

• RLMM pronounced pronounced ““REALMREALM””• Based on an RMA-like model

– Uses PM only– Linear additive multi-chip model on log scale– A- and B-probe and chip effects– Robustly estimated parameters

• Classification using Mahalanobis’ distance

13

RLMM: single SNP, multi-chip model

For SNP n we fit the following models for the A and B-probes toquantile-normalized PM values yi,A,j

(n) and yi,B,j(n) .

log2(yi,A,j(n)) = θA,i

(n) + βA,j(n) + εij ,

log2(yi,B,j(n)) = θB,i

(n) + βB,j(n) + εij ,

where θA,i(n) and θB,i

(n) are the A- and B-effects for sample i,and βA,j and βB,j are the relative probe affinities, subject to ∑βA,j

(n) = ∑βB,j

(n) =0.As errors are likely to be contaminated due to outlier probes,

we use a robust linear model to estimate the parameters.

14

RLMM: outline of the algorithm1. Quantile normalize PM intensities across chips.2. For SNP n, obtain estimates of (θ(n)

A,i ,θ(n)B,i ) for each

sample i in the training set using the previous model.3. Estimate the mean vectors (µ(n)

AA, µ(n)AB , µ(n)

BB)and covariance matrices (Σ(n)

AA , Σ(n)AB , Σ(n)

BB) ofthe 2-dimensional vectors (θ(n)

A,i ,θ(n)B,i) using samples

from the AA, AB and AB groups in the training set.4. Obtain estimates (θ(n)

A,i ,θ(n)B,i ) for each sample i in the

test set.5. Classify each sample in the test set to the genotype

group closest to it in Mahalanobis’ distance.

15

Mahalanobis’ distanceIntroduced by P.C. Mahalanobis in1936.A Euclidean-type metric which takesinto account the variances andcovariances, here Σg , between thecomponents θA and θ B of θ = (θA ,θB) :

D2g(θ) = (θ – µg)’Σ-1

g(θ – µ g)where D2

g(θ) is the generalized squareddistance of the θ vector from the mean µgof genotype group g = AA, AB or BB.We choose the g with smallest D2

g(θ).Note: we are not using ^’s to designateestimates, trusting to context.

16

From raw intensities to θ values:AA

SNP 5 data from 13 AA samples (horizontally)

PMA+PMA-PMB+PMB-

PMA+PMA-PMB+PMB-

Relatively low (high)intensity probeRelatively

dim chip

17

From raw intensities to θ values: AB

SNP5 data from 39 AB samples

PMA+PMA-PMB+PMB-

PMA+PMA-PMB+PMB-

18

From raw intensities to θ values: BB

SNP 5 data from 75 BB samples

PMA+PMA-PMB+PMB-

PMA+PMA-PMB+PMB-

19

SNP 5: θ- and residual plots

BB

AA

AB

Every sample has its (θA ,θB) pair: plot them!Do likewise for the residuals in the fitted model.

Residuals areuseful for QC;here skewed b/c of + strand failure.

Similar plots are used byAB, Chemicon and Illumina.

New sample points are assigned to the “closest” genotype

20

SNP 200655: θ- and residual plots

A more satisfactory SNP’s plots.

21

A-1706313 (DM NoCalls=10%) A-1659973 (Nocalls=23%)

A-1726964 (Nocalls=19%) A-1657538 (DM NoCalls = 6%)

Here are fourSNPs with someharder calls: thegenotype groupsare closer togetherand internallymore straggly.

The DM defaultmakes NCs onthese.

22

Empirical Bayes Multi-SNP model Averaging of genotype centerscenters (µ(n)

AA, µ(n)AB , µ(n)

BB)and covariance matrices (Σ(n)

AA, Σ(n)AB , Σ(n)

BB) acrossSNPs n leads to

• empirically estimated conjugate Gaussian prior,• giving prior estimates of genotype means and

covariance matrices for all SNPs,• which when combined with the data for a particular

SNP, gives• better estimates of genotype group means and

covariance matrices, and hence better genotypicassignments for that SNP.

Main benefit: better genotype prediction when there arefew or no training samples with a given genotype.

23

RLMM (no NCs) vs HapMap

1,478327,77249832BB

1,699184356,575196AB

1,44012476339,756AA

NCBBABAAHapMapRLMM

11,446 SNPs, 90 samples, LOOCV99.86% concordance (both called)1,398 discordant calls

24

Availability

A version of RLMM will go into the opensource R-based Bioconductor package

before the end of this summer.

25

Leishmaniasis

26BALB/c C57BL/6

27

L. major response loci in mice

• lmr1 Chromosome 17– MHC region– BALB/c susceptible

• lmr2 Chromosome 9– BALB/c susceptible

• lmr3 X Chromosome– C57BL/6 susceptible in the presence of BALB/c

homozygosity at lmr1

28

lmr1, lmr2, and lmr3 affectthe course of disease

0

1

2

3

4

5

2 3 4 5 6 7 8 9 10 11 12

B/c.lmr3BALB/cB/c.lmr1B/c.lmr2

Aver

age

lesio

n sc

ore

Week post infection

*

* p < 0.05

29

C.lmr1/2• BALB/c background• lmr1 and lmr2 from C57BL/6• Predict: More resistant than BALB/c

B6.lmr1/2• C57BL/6 background• lmr1 and lmr2 from BALB/c• Predict: more susceptible than C57BL/6

Compound congenics

30

0

1

2

3

4

5

2 3 4 5 6 7 8 9 10 11 12

B/c.lmr3BALB/cB/c.lmr1B/c.lmr2B/c.lmr1/2B6.lmr1/2B6.lmr1B6.lmr2C57BL/6

Course of infection in strainscongenic for lmr loci

weeks post infection

aver

age

lesio

n sc

ore

31

Summary of challenge infections

• All three loci confirmed to play a role inresponse to L. major infection

• Having all three resistance alleles(C.lmr1/2/3) or all three susceptibility alleles(B6.lmr1/2/3) does NOT recapitulate theparental phenotype in every mouse

• There are possibly other genes involved

32

Infected macrophages

C57BL/6 B6.lmr1/2

33

Design of microarray experiment

C57BL/6uninfected

B6.lmr1/2uninfected

C57BL/6infected

B6.lmr1/2infected

BALB/cuninfected

B/c.lmr1/2infected

BALB/cinfected

B/c.lmr1/2uninfected

Boxes indicate bone marrow derived macrophage samples arrayed on Affymetrix chips; red arrows indicate comparisons of interest.

34

Uninfected B6.lmr1/2vs uninfected C57BL/6

• 83 genes t* > 5– Antigen presentation– Receptors– Cell surface– Chemokines– Inflammatory response– Cytoskeleton

Extracellular matrix9 genes in C57BL/6– Cell cycle– Mitochondrial– Signal transduction– Transcription factors

*Analysis carried out with RMA and limma, t here denotingmoderated Student t-statistic; qq-plots also used.

35

Genes differently differentially expressed*

• Over 20 genes common to both arms of the experiment• Some immunological genes and others• Genes involved in tissue remodelling, wound repair and

extracellular matrix deposition– Metalloproteinases– Cytokines involved in extracellular matrix deposition– Collagens

Hypothesis: wound repair is important

*Again analysis done in limma, this time a 2×2 factorial analysis.

36

Is a lesion a wound which fails to heal?

Rate of wound healing

0

0.5

1

1.5

2

2.5

0 3 4 5 6 7 8 9 10 11Days

Lesi

on

Siz

e (

mm

)

BALBcC.lmr1/2C57BL6BL6.lmr1/2

38

Collagen bundles in congenics

C57BL/6 B6.Clmr1/2 C.B6lmr1/2 BALB/c

Uninf.punch biopsies

L.majorinfected

39

Conclusions

• lmr1, lmr2 and lmr3 affect progression of disease• Expression of Th1/Th2 cytokines is not mediated by lmr1,

lmr2, or lmr3 loci at any time during infection (not shown)• Early difference in cytokine response not seen (not shown)• Microarray analysis of macrophages has identified genes

involved in wound healing as being important.

• Wound healing experiments show that collagen depositionis indeed different between congenics and parentals.

40

Acknowledgements

Nusrat Rabbee, UCB

Simon Cawley, Affymetrix

Simon FooteEmanuela HandmanColleen ElsoLynden RobertsAnuratha SakthiandeswarenJoan CurtisDenise BullenBeena KumarLynn BuckinghamFleur RoddaClaire, Kerry and Melissa (Kew)Tracey Baldwin

Funding: HHMI, NIH, NHMRC, Gene CRC, NSF

Gordon SmythRussell ThompsonKen Simpson

All WEHI

41

42

DM vs HapMap (no NCs)

1,452327,4151,13225BB

1,745544355,168457AB

1,42091,249339,502AA

NCBBABAAHapMapDM

11,446 SNPs, 90 samples99.67% concordance (valid calls) 3,416 discordant calls

43

Comparison TableComparison TableRLMM RLMM vs vs DMDM

(n=11,446 SNPs)(n=11,446 SNPs)99.7% concordance99.7% concordance

Total discordant calls: 2866Total discordant calls: 2866

32916483228BB

592356899445AB

24945341211AA

BBABAADMRLMM

top related