analysis of affymetrix and illumina array data

30
Analysis of Affymetrix and Illumina Array Data SPH 247 Statistical Analysis of Laboratory Data 1 May 7, 2010 SPH 247 Statistical Analysis of Laboratory Data

Upload: nola

Post on 10-Feb-2016

74 views

Category:

Documents


3 download

DESCRIPTION

Analysis of Affymetrix and Illumina Array Data. SPH 247 Statistical Analysis of Laboratory Data. Basic Design of Expression Arrays. For each gene that is a target for the array, we have a known DNA sequence. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 1

Analysis of Affymetrix and Illumina Array Data

SPH 247Statistical Analysis of

Laboratory Data

May 7, 2010

Page 2: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 2

Basic Design of Expression ArraysFor each gene that is a target for the array,

we have a known DNA sequence.mRNA is reverse transcribed to DNA, and if a

complementary sequence is on the on a chip, the DNA will be more likely to stick

The DNA is labeled with a dye that will fluoresce and generate a signal that is monotonic in the amount in the sample

May 7, 2010

Page 3: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 3May 7, 2010

TAAATCGATACGCATTAGTTCGACCTATCGAAGACCCAACACGGATTCGATACGTTAATATGACTACCTGCGCAACCCTAACGTCCATGTATCTAATACGATTTAGCTATGCGTAATCAAGCTGGATAGCTTCTGGGTTGTGCCTAAGCTATGCAATTATACTGATGGACGCGTTGGGATTGCAGGTACATAGATTATGC

Exon Intron

Probe Sequence

• cDNA arrays use variable length probes derived from expressed sequence tags– Spotted and almost always used with two color methods– Can be used in species with an unsequenced genome

• Long oligoarrays use 60-70mers– Agilent two-color arrays– Illumina Bead Arrays– Usually use computationally derived probes but can use probes

from sequenced EST’s

Page 4: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 4

Affymetrix GeneChips use multiple 25-mersFor each gene, one or more sets of 8-20

distinct probes May overlap May cover more than one exon

Affymetrix chips also use mismatch (MM) probes that have the same sequence as perfect match probes except for the middle base which is changed to inhibitbinding.

This is supposed to act as a control, but often instead binds to another mRNA species, so many analysts do not use them

May 7, 2010

Page 5: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 5

Illumina Bead ArraysBeads are coated with many copies of a 50-

mer gene specific probe and a 29-mer address sequence

Multiple beads per probe, random, but around 20

Each chip of the Ref-8 contains 8 arrays with ~ 25,000 targets, plus controls

Each chip of the WG-6 contains 6 arrays with ~ 50,000 targets, plus controls

May 7, 2010

Page 6: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 6

Probe DesignA good probe sequence should match the

chosen gene or exon from a gene and should not match any other gene in the genome.

Melting temperature depends on the GC content and should be similar on all probes on an array since the hybridization must be conducted at a single temperature.

May 7, 2010

Page 7: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 7

The affinity of a given piece of DNA for the probe sequence can depend on many things, including secondary and tertiary structure as well as GC content.

This means that the relationship between the concentration of the RNA species in the original sample and the brightness of the spot on the array can be very different for different probes for the same gene.

Thus only comparisons of intensity within the same probe across arrays makes sense.

May 7, 2010

Page 8: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 8

Affymetrix GeneChipsFor each probe set, there are 8-20 perfect

match (PM) probes which may overlap or not and which target the same gene

There are also mismatch (MM) probes which are supposed to serve as a control, but do so rather badly

Most of us ignore the MM probes

May 7, 2010

Page 9: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 9

Expression IndicesA key issue with Affymetrix chips is how to

summarize the multiple data values on a chip for each probe set (aka gene).

There have been a large number of suggested methods.

Generally, the worst ones are those from Affy, by a long way; worse means less able to detect real differences

Summary of Illumina beads is simpler, but there are still issues.

May 7, 2010

Page 10: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 10

Usable MethodsLi and Wong’s dCHIP and follow on work is

demonstrably better than MAS 4.0 and MAS 5.0, but not as good as RMA and GLA

The RMA method of Irizarry et al. is available in Bioconductor.

The GLA method (Durbin, Rocke, Zhou) is also available in Bioconductor/CRAN as part of the LMGene R package

May 7, 2010

Page 11: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 11

Bioconductor Documentation> library(affy)Loading required package: BiobaseLoading required package: tools

Welcome to Bioconductor

Vignettes contain introductory material. To view, type

'openVignette()'. To cite Bioconductor, see 'citation("Biobase")' and for packages 'citation(pkgname)'.

Loading required package: affyioLoading required package: preprocessCore

May 7, 2010

Page 12: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 12

Bioconductor Documentation> openVignette()Please select a vignette:

1: affy - 1. Primer 2: affy - 2. Built-in Processing Methods 3: affy - 3. Custom Processing Methods 4: affy - 4. Import Methods 5: affy - 5. Automatic downloading of CDF packages 6: Biobase - An introduction to Biobase and ExpressionSets 7: Biobase - Bioconductor Overview 8: Biobase - esApply Introduction 9: Biobase - Notes for eSet developers10: Biobase - Notes for writing introductory 'how to' documents11: Biobase - quick views of eSet instances

Selection:

May 7, 2010

Page 13: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 13

Reading Affy Data into RThe CEL files contain the data from an array.

We will look at data from an older type of array, the U95A which contains 12,625 probe sets and 409,600 probes.

The CDF file contains information relating probe pair sets to locations on the array. These are built into the affy package for standard types.

May 7, 2010

Page 14: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 14

Example Data SetData from Robert Rice’s lab on twelve

keratinocyte cell lines, at six different stages.Affymetrix HG U95A GeneChips.For each “gene”, we will run a one-way

ANOVA with two observations per cell.For this illustration, we will use RMA.

May 7, 2010

Page 15: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 15

Files for the Analysis.CDF file has U95A chip definition (which

probe is where on the chip). Built in to the affy package.

.CEL files contain the raw data after pixel level analysis, one number for each spot. Files are called LN0A.CEL, LN0B.CEL…LN5B.CEL and are on the web site.

409,600 probe values in 12,625 probe sets.

May 7, 2010

Page 16: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 16

The ReadAffy functionReadAffy() function reads all of the CEL

files in the current working directory into an object of class AffyBatch, which is itself an object of class ExpressionSet

ReadAffy(widget=T) does so in a GUI that allows entry of other characteristics of the dataset

You can also specify filenames, phenotype or experimental data, and MIAME information

May 7, 2010

Page 17: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 17May 7, 2010

rrdata <- ReadAffy()

> class(rrdata)[1] "AffyBatch"attr(,"package")[1] "affy“

> dim(exprs(rrdata))[1] 409600 12

> colnames(exprs(rrdata)) [1] "LN0A.CEL" "LN0B.CEL" "LN1A.CEL" "LN1B.CEL" "LN2A.CEL" "LN2B.CEL" [7] "LN3A.CEL" "LN3B.CEL" "LN4A.CEL" "LN4B.CEL" "LN5A.CEL" "LN5B.CEL"

> length(probeNames(rrdata))[1] 201800> length(unique(probeNames(rrdata)))[1] 12625> length((featureNames(rrdata)))[1] 12625> featureNames(rrdata)[1:5][1] "100_g_at" "1000_at" "1001_at" "1002_f_at" "1003_s_at"

Page 18: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 18

The ExpressionSet classAn object of class ExpressionSet has several

slots the most important of which is an assayData object, containing one or more matrices. The best way to extract parts of this is using appropriate methods.exprs() extracts an expression matrixfeatureNames() extracts the names of the probe sets.

May 7, 2010

Page 19: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 19

Expression IndicesThe 409,600 rows of the expression matrix in

the AffyBatch object Data each correspond to a probe (25-mer)

Ordinarily to use this we need to combine the probe level data for each probe set into a single expression number

This has conceptually several steps

May 7, 2010

Page 20: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 20

Steps in Expression Index ConstructionBackground correction is the process of

adjusting the signals so that the zero point is similar on all parts of all arrays.

We like to manage this so that zero signal after background correction corresponds approximately to zero amount of the mRNA species that is the target of the probe set.

May 7, 2010

Page 21: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 21

Data transformation is the process of changing the scale of the data so that it is more comparable from high to low.

Common transformations are the logarithm and generalized logarithm

Normalization is the process of adjusting for systematic differences from one array to another.

Normalization may be done before or after transformation, and before or after probe set summarization.

May 7, 2010

Page 22: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 22

One may use only the perfect match (PM) probes, or may subtract or otherwise use the mismatch (MM) probes

There are many ways to summarize 20 PM probes and 20 MM probes on 10 arrays (total of 200 numbers) into 10 expression index numbers

May 7, 2010

Page 23: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 23May 7, 2010

0 1 10 100 Mean

200618_at1 360 216 158 198 233.0200618_at2 313 402 106 103 231.0200618_at3 130 182 79 91 120.5200618_at4 351 370 195 136 263.0200618_at5 164 130 98 107 124.8200618_at6 223 219 164 196 200.5200618_at7 437 529 195 158 329.8200618_at8 509 554 274 128 366.3200618_at9 522 720 285 198 431.3200618_at10 668 715 247 260 472.5200618_at11 306 286 144 159 223.8

ExpressionIndex 362.1 393.0 176.8 157.6

Probe intensities for LASP1 in a radiationdose-response experiment

Page 24: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 24May 7, 2010

Log probe intensities for LASP1 in a radiationdose-response experiment

0 1 10 100 Mean

200618_at1 2.56 2.33 2.20 2.30 2.35

200618_at2 2.50 2.60 2.03 2.01 2.28

200618_at3 2.11 2.26 1.90 1.96 2.06

200618_at4 2.55 2.57 2.29 2.13 2.38

200618_at5 2.21 2.11 1.99 2.03 2.09

200618_at6 2.35 2.34 2.21 2.29 2.30

200618_at7 2.64 2.72 2.29 2.20 2.46

200618_at8 2.71 2.74 2.44 2.11 2.50

200618_at9 2.72 2.86 2.45 2.30 2.58

200618_at10 2.82 2.85 2.39 2.41 2.62

200618_at11 2.49 2.46 2.16 2.20 2.33

ExpressionIndex 2.51 2.53 2.21 2.18

Page 25: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 25

The RMA MethodBackground correction that does not make 0

signal correspond to 0 amountQuantile normalizationLog2 transformMedian polish summary of PM probes

May 7, 2010

Page 26: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 26May 7, 2010

> eset <- rma(rrdata)trying URL 'http://bioconductor.org/packages/2.1/…Content type 'application/zip' length 1352776 bytes (1.3 Mb)opened URLdownloaded 1.3 Mb

package 'hgu95av2cdf' successfully unpacked and MD5 sums checked

The downloaded packages are in C:\Documents and Settings\dmrocke\Local Settings…updating HTML package descriptionsBackground correctingNormalizingCalculating Expression

> class(eset)[1] "ExpressionSet"attr(,"package")[1] "Biobase"> dim(exprs(eset))[1] 12625 12> featureNames(eset)[1:5][1] "100_g_at" "1000_at" "1001_at" "1002_f_at" "1003_s_at"

Page 27: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 27May 7, 2010

> exprs(eset)[1:5,] LN0A.CEL LN0B.CEL LN1A.CEL LN1B.CEL LN2A.CEL LN2B.CEL LN3A.CEL100_g_at 9.195937 9.388350 9.443115 9.012228 9.311773 9.386037 9.3860891000_at 8.229724 7.790238 7.733320 7.864438 7.620704 7.930373 7.5027591001_at 5.066185 5.057729 4.940588 4.839563 4.808808 5.195664 4.9528831002_f_at 5.409422 5.472210 5.419907 5.343012 5.266068 5.442173 5.1904401003_s_at 7.262739 7.323087 7.355976 7.221642 7.023408 7.165052 7.011527 LN3B.CEL LN4A.CEL LN4B.CEL LN5A.CEL LN5B.CEL100_g_at 9.394606 9.602404 9.711533 9.826789 9.6455651000_at 7.463158 7.644588 7.497006 7.618449 7.7101101001_at 4.871329 4.875907 4.853802 4.752610 4.8343171002_f_at 5.200380 5.436028 5.310046 5.300938 5.4278411003_s_at 7.185894 7.235551 7.292139 7.218818 7.253799

Page 28: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 28May 7, 2010

> summary(exprs(eset)) LN0A.CEL LN0B.CEL LN1A.CEL LN1B.CEL Min. : 2.713 Min. : 2.585 Min. : 2.611 Min. : 2.636 1st Qu.: 4.478 1st Qu.: 4.449 1st Qu.: 4.458 1st Qu.: 4.477 Median : 6.080 Median : 6.072 Median : 6.070 Median : 6.078 Mean : 6.120 Mean : 6.124 Mean : 6.120 Mean : 6.128 3rd Qu.: 7.443 3rd Qu.: 7.473 3rd Qu.: 7.467 3rd Qu.: 7.467 Max. :12.042 Max. :12.146 Max. :12.122 Max. :11.889 LN2A.CEL LN2B.CEL LN3A.CEL LN3B.CEL Min. : 2.598 Min. : 2.717 Min. : 2.633 Min. : 2.622 1st Qu.: 4.444 1st Qu.: 4.469 1st Qu.: 4.425 1st Qu.: 4.428 Median : 6.008 Median : 6.058 Median : 6.017 Median : 6.028 Mean : 6.109 Mean : 6.125 Mean : 6.116 Mean : 6.117 3rd Qu.: 7.426 3rd Qu.: 7.422 3rd Qu.: 7.444 3rd Qu.: 7.459 Max. :13.135 Max. :13.110 Max. :13.106 Max. :13.138 LN4A.CEL LN4B.CEL LN5A.CEL LN5B.CEL Min. : 2.742 Min. : 2.634 Min. : 2.615 Min. : 2.590 1st Qu.: 4.468 1st Qu.: 4.433 1st Qu.: 4.448 1st Qu.: 4.487 Median : 6.074 Median : 6.050 Median : 6.053 Median : 6.068 Mean : 6.122 Mean : 6.120 Mean : 6.121 Mean : 6.123 3rd Qu.: 7.460 3rd Qu.: 7.478 3rd Qu.: 7.477 3rd Qu.: 7.457 Max. :12.033 Max. :12.162 Max. :11.925 Max. :11.952

Page 29: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 29

Probe Sets not GenesIt is unavoidable to refer to a probe set as

measuring a “gene”, but nevertheless it can be deceptive

The annotation of a probe set may be based on homology with a gene of possibly known function in a different organism

Only a relatively few probe sets correspond to genes with known function and known structure in the organism being studied

May 7, 2010

Page 30: Analysis of  Affymetrix  and  Illumina  Array  Data

SPH 247 Statistical Analysis of Laboratory Data 30

ExerciseDownload the ten arrays from the web siteLoad the arrays into R using Read.Affy and

construct the RMA expression indices

May 7, 2010