data analytical issues with high-density oligonucleotide arrays a model for gene expression analysis...

31
Data analytical issues with high- density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Post on 21-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Data analytical issues with high-density oligonucleotide arrays

A model for gene expression analysis and data quality assessment

Page 2: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Outline

• Description of high-density oligonucleotide expression array data

• Derivation of a model for gene expression estimation

• Application of the model for data quality assessment

Page 3: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Gene Expression Analysis

• Central Dogma:

DNA -> mRNA -> Protein

• By comparing the abundance of mRNA in different cells we can deduce the genes associated with cell condition.

• Oligonucleotide arrays enable quantitative, highly parallel measurements of gene expression.

Page 4: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Probe Selection

Probes are 25-mer selected from target sequence.

5-20K target fragments are interrogated by probe sets of 11-20 probes.

Page 5: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Data preparation

• RNA samples are prepared, labeled, and hybridized with arrays.

• Arrays are scanned and the resulting image analyzed to produce an intensity value for each probe cell indicating how much hybridization occurred.

• Of interest is to find a way to combine probe intensities for a given gene to produce an index of expression – an indicator of mRNA abundance.

Page 6: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Oligonucleotide Arrays

18µm18µm

101066-10-1077 copies of a specific copies of a specificoligonucleotide probe per featureoligonucleotide probe per feature

Image of Hybridized Probe ArrayImage of Hybridized Probe Array

>450,000 different>450,000 differentprobes probes

Single stranded, Single stranded, labeled RNA targetlabeled RNA target

Oligonucleotide probeOligonucleotide probe

**

**

*

1.28cm1.28cm

GeneChipGeneChip Probe ArrayProbe ArrayHybridized Probe CellHybridized Probe Cell

Compliments of D. Gerhold

Page 7: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Outline

• Description of high-density oligonucleotide expression array data

• Derivation of a model for gene expression analysis

• Application of the model for data quality assessment

Page 8: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Probe Intensity vs conc ex 1

Page 9: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

The probe intensity model

On a probe set by probe set basis, the log of the probe intensities, Yjk say, are modelled as the sum of a probe effect and a chip effect:

Yjk = j + k + jk

To make this model identifiable, we constrain the sum of the probe effects to be zero. The j ‘s can be interpreted as a relative non-specific binding effects for probes.

The parameters k provide an index of expression for each chip.

Page 10: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Example - detecting differential expression

Fit the model to 24 chips with common source of RNA + 12 RNA spiked in at 2-fold pM concentrations between the two groups of 12.

Probe set Group A Group B1 0.25 0.52 0.5 13 1 24 2 45 4 86 8 167 16 328 32 649 128 256A 256 512B 512 1024C 512 1024

Page 11: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

MVA A vs B

Page 12: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Index vs Conc

Page 13: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Robust procedures

Robust procedures perform well under a range of possible models and greatly facilitates the detection of anomalous data points.

Why robust?

• Image artifacts

• Bad probes

• Bad chips

• Quality assessment

Page 14: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Robust fit example A

Page 15: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Robust fit example B

Page 16: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Residuals from fit

Page 17: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Outline

• Description of high-density oligonucleotide expression array data

• Derivation of a model for gene expression estimation

• Application of the model to data quality assessment

Page 18: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Chip manufacturer QA protocols

• Starting RNA QA – look at gel patterns and RNA quantification.

• Post hybridization QA – image examination, chip intensity parameters, expressions for control genes of various sorts, house keeping genes, percent present calls.

Page 19: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Goal: measuring expression data quality

Manufacturer QA guidelines emphasize maintenance of data comparability across chips in analysis set.

We seek assessments that measure data quality as it pertains to expression values.

In particular, would like to provide quantitative measures that can help making decisions – Accept, Reject or Adjust.

Page 20: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Model components – role in QA

• Probe effects- can only be compared across fitting sets.

- Chip effects - expression indices- can examine distribution of relative expressions across arrays.

• Residuals – more than 200K per chip.- view as chip image, summarize spatial

patterns.- summarize in batches by chip.- combine to estimate SE of expression

indices and these pooled and summarized by chip.

Page 21: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Robust fit by IRLS for each probe set

Starting with robust fit, at each iteration:

S = mad(rjk) – robust estimate of scale or

ujk = rjk/S – standardized residuals

wjk =(|ujk|) – weights to reduce the effect of deviant points on next fit

The SE of the final expression index is given by

SE(ak) = S/(j wjk)

Unscaled SE(ak) = 1/(j wjk)

Page 22: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

function

Page 23: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Images of weights

• For 24 chips from Affymetrix, look at patterns of weights on chip real estate.

Page 24: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Images of weights

Page 25: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Images sign of residuals

• For 24 chips from Affymetrix, look at patterns of sign of residuals on chip real estate.

Page 26: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Images of sign of residuals

Page 27: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Residual summaries

Page 28: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

MVA exp index

Page 29: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Future developments

Develop quality assessment measures for routine use in large throughput environment.

Assess relationships among various QA measures.

Develop diagnostics to assign causes to departures from quality standards.

Other applications -

Identify non-performing or cross-hybridizing probes, qualify probe sets.

Page 30: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

References

1. New Statistical Algorithms for Monitoring Gene Expression on GeneChip® Probe Arrays, Affymetrix technical report.

2. Array Design for the GeneChip® Human Genome U133 Set, Affymetrix technical note.

3. Discussion on Background, Ben Bolstad.

4. Bolstad BM, et. al. (2003), A comparison of normalization methods for high density oligonucleotide array data basedon variance and bias.Bioinformatics. 2003 Jan 22;19(2):185-193.

5. Irizarry, R. et.al (2003) Summaries of Affymetrix GeneChip probe level data, Nucleic Acids Research, 2003, Vol. 31, No. 4 e15

6. Irizarry, R. et. al. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, in press.

7. http://array.mc.vanderbilt.edu/Pages/VMSR_Info/Sample_submission.htm

Page 31: Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment

Background correction

Background correction - to correct for differential background due to experimental processing effects and to put the estimated differential expression on a proper scale.

Normalization – to correct for systematic differences in the distribution of probe intensities