beadarray expression analysis using bioconductor...beadarray expression analysis using bioconductor...

BeadArray expression analysis using Bioconductor

Mark Dunning, Wei Shi, Andy Lynch, Mike Smith and Matt Ritchie

November 5, 2019

Introduction to this vignette

This vignette describes how to process Illumina BeadArray gene expression data in its var-ious formats; raw files produced by the scanning software as well as summarized BeadStu-dio/GenomeStudio output, and data deposited in a public microarray database. For each fileformat we introduce a series of ‘use cases’ in the order in which they might be encounteredby an analyst, and demonstrate how to perform specific tasks on the example datasets usingR code from Bioconductor or base packages. The R code is explained immediately afterwardsand where appropriate we given an interpretation of the resultant output (indicated by the 2�symbol), and pointers (Z) to how the R code might be adapted to other datasets or use cases.Where attention must be paid to avoid alarming results, we give warnings indicated by the symbol.It is assumed that the reader has familiarity with basic R data structures and operations suchas plotting and reading text files, therefore explanations of these functions will not be given.

The version of R used to compile this vignette is 3.6.1. The Bioconductor packages requiredcan be installed by typing the following commands at the R command prompt:

1 > if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")

2 > BiocManager :: install(c("beadarray", "limma", "GEOquery",

3 + "illuminaHumanv1.db", "illuminaHumanv2.db", "illuminaHumanv3.db",

4 + "BeadArrayUseCases"))

There are also a few packages that are used to illustrate particular use case, but are not crucialto the vignette. They can be installed in advance, or as required.

1 > BiocManager :: install(c("GOstats", "GenomicRanges", "Biostrings"))

Introduction to the BeadArray technology

Illumina Inc. (San Diego, CA) are manufacturers of the BeadArray microarray technology,which can be used in genomic studies to profile transcript expression or methylation status, orgenotype SNPs. The BeadArray platform makes use of 50mer oligonucleotide probes attachedto beads that are randomly assembled and replicated many times on each array. Differentprobes are designed to target specific positions (transcripts/SNPs) in the genome. A series of

1

View of a BeadArray

WG-6 ExpressionBeadChip

randomly assembled, replicate beads

Figure 1: Schematic of a WG-6BeadChip and BeadArray section.This type of BeadChip is made upof 12 sections (also referred to as‘strips’) which are paired to allow6 samples to be processed in par-allel per chip. Sections are denselypacked with beads which are ran-domly assembled on the array sur-face. Another feature of the tech-nology is the high degree of replica-tion of beads assigned a particularprobe sequence (from a possible setof ∼ 48,000 unique sequences).

decoding hybridizations performed in-house by Illumina determines the identity of each beadon each array.

A BeadChip comprises of a series of rectangular BeadArray sections on a slide, with eachsection (sometimes referred to as a strip) containing many thousands of different probes (seeFigure 1). The naming of a chip is decided by the number of unique hybridizations possibleand the version of the probe annotation. For example, there are six pairs of sections on eachHuman WG-6 version 2 BeadChip, and 12 sections, one per sample on a Human HT-12 version3 BeadChip. The high degree of replication makes robust measurements for each probe possibleand provides the opportunity to detect and correct for spatial artefacts that occur on the arraysurface. Illumina’s gene expression arrays also contain many control probes, and one partic-ular class of these, known as ‘negative’ controls can be used in the analysis to improve inference.

2

Experimental data

We make use of three datasets in this vignette (summarized in Table 1) to illustrate how datain different formats that have undergone varying degrees of preprocessing can be imported andanalyzed using Bioconductor software.

The first example consists of bead-level data from a series of Human HT-12 version 3 Bead-Chips hybridized at the Cancer Research UK, Cambridge Research Institute Genomics Corefacility. ‘Bead-level’ refers to the availability of intensity and location information for eachbead on each BeadArray in an experiment. In this dataset, BeadArrays were hybridized witheither Universal Human Reference RNA (UHRR, Stratagene) or Brain Reference RNA (Am-bion) as used in the MAQC project [1]. Bead-level data for all 12 arrays are included in the filebeadlevelbabfiles.zip which is available as part of the BeadArrayUseCases package. Thesedata are in the compressed .bab format [2], which can be analysed using the beadarray pack-age. For one array (4613710052 B) we also provide the uncompressed data including the rawTIFF image. This allows us to demonstrate the processing options available when pixel levelinformation is at hand.

The second dataset is the AsuragenMAQC_BeadStudioOutput.zip file from Illumina’s websitehttp://www.switchtoi.com/datasets.ilmn

This experiment uses a Human WG-6 version 2 BeadChip, and consists of 3 replicates of eachof the UHRR and Brain Reference samples. These data are available at the summary levelas generated by Illumina’s BeadStudio software. The Bioconductor packages lumi, limma andbeadarray can all handle data in this format. For this dataset, we focus on tools available inthe limma package.

The final dataset, which is also at the summary level, is publicly available from the GEOdatabase (accession GSE5350) and was deposited by the MAQC project [1]. This experimentincluded pure UHRR and Brain Reference RNA samples as well as mixtures of these twosamples (the mixture proportions were 75:25 and 25:75) hybridized to Human WG-6 version1 BeadChips. We retrieve this experiment from the GEO database using the GEOquery package.

The goal of each analysis is to find differentially expressed probes between the two distinct RNAsamples included in each experiment (UHRR and Brain Reference). For the differential ex-pression analysis, we make use of the linear modelling approach available in the limma package.

3

Table 1: Summary of the datasets analyzed in the vignette.Number of arrays Number of array sections

Dataset Type of data Array Generation UHRR Brain UHRR Brain

1 raw/bead-level HT-12 version 3 6 6 6 6

2 summary (BeadStudio) WG-6 version 2∗ 3 3 6 6

3 summary (GEO) WG-6 version 1 15 15 30 30

∗ Note that these data can not be analyzed at the section-level (raw or bead-level data wouldbe required for this).

1 Analysis of bead-level and raw data using beadarray

Raw and bead-level data types

Reading raw or bead-level data into R using the beadarray package requires several files producedby Illumina’s scanning software. We briefly describe these files below.

� text files (required - unless bab files present) - a text file (.txt or .csv) for each sectionwhich stores the position, identity and intensity of each bead. These files are usuallynamed chipID_section.txt for arrays from BeadChips (e.g. 4613710017_B.txt) andare required because of the random arrangement of probes on the array surface that isunique for each BeadArray.

� locs files (optional) - 1 (single channel) or 2 (two-color) for each section on a BeadChip.These are usually named using the convention chipID_section_channel.locs. Thelocs file stores the locations of all beads on the array, including all those that could notbe decoded (beads present on the array, but not in the text files). The locs files areparticularly useful for investigating spatial phenomena on the arrays.

� bab files (optional) - one for each section of a BeadChip. These files contain all of theinformation from the text and locs files, stored in a substantially more efficient manner[2].

� tiff files (optional) - 1 (single channel) or 2 (two-color) for each section on a BeadChip.These are usually named using the convention chipID_section_channel.tif. For ex-ample, 4613710017_B_Grn.tif is the Cy3 (green) image for the sample in position B onBeadChip 4613710017. The Cy5 (red) files end in _Red.tif. We refer to these as the rawdata, as access to these images allows the user to carry out their own image processing,and provides access to the background intensities (the values stored in the text files havealready been background corrected).

� sdf file (optional) - Illumina’s sample descriptor file (one per BeadChip, e.g.4613710017.sdf). This file is used to determine the physical properties of a section andwhich sections to combine for each sample.

� targets file (recommended) - contains sample information for each array. See the filetargetsHT12.txt for an example.

4

� metrics file (optional) - one for each BeadChip (usually named Metrics.txt) whichcontains summary information about intensity, the amount of saturation, focus and reg-istration on the image(s) from each section.

To obtain the tiff and text files from Illumina’s BeadScan software version 3.1 you willneed to modify the settings.xml file used by the software. For further details see the Scan-ning Settings section of http://www.compbio.group.cam.ac.uk/Resources/illumina/.For iScan, the steps are similar, although there is a separate settings file for each platform(expression, Infinium genotyping, etc).

Quality assessment using scanner metrics

The first view of array quality can be assessed using the metrics calculated by the scanner.These include the 95th (P95) and 5th (P05) quantiles of all pixel intensities on the image.A signal-to-noise ratio (SNR) can be calculated as the ratio of these two quantities. Thesemetrics can be viewed in real-time as the arrays themselves are being scanned. By trackingthese metrics over time, one can potentially halt problematic experiments before they evenreach the analysis stage.

Use Case: Plot the P95:P05 signal-to-noise ratio for the HT-12 arrays in this experiment andassess whether any samples appear to be outliers that may need to be removed or down-weightedin further analyses.

1 > ht12metrics <- read.table(system.file("extdata/Chips/Metrics.txt",

2 + package = "BeadArrayUseCases"), sep = "\t", header = TRUE ,

3 + as.is = TRUE)

4 > ht12snr <- ht12metrics$P95Grn/ht12metrics$P05Grn

5 > labs <- paste(ht12metrics[, 2], ht12metrics[, 3], sep = "_")

6 > par(mai = c(1.5, 0.8, 0.3, 0.1))

7 > plot (1:12 , ht12snr , pch = 19, ylab = "P95 / P05", xlab = "",

8 + main = "Signal -to-noise ratio for HT12 data", axes = FALSE ,

9 + frame.plot = TRUE)

10 > axis (2)

11 > axis(1, 1:12, labs , las = 2)

The above code uses standard R functions to obtain the P95 and P05 values from the metricsfile stored in the package. The system.file function is a base function that will locate thedirectory where the BeadarrayUseCases package is installed. The plotting commands are justa suggestion of how the data could be presented and could be adapted to individual circum-stances.

2� The SNR for these arrays seems to be over 15 in most cases, although there is one excep-tion that is below 2. Illumina recommend that this ratio be above 10 and in our experiencea value below 2 would be grounds for discarding a sample from further analysis. The P95value for this sample is typical of what we would expect from a blank array with no RNAhybridized!

Z Scanner metrics information is equally as useful for sample quality assessment of summa-rized data.

5

Z The P95 and P05 values will fluctuate over time and are dependant upon the scannersetup. Including SNR values for arrays other than those currently being analysed will give abetter indication of whether any outlier arrays exist.

Data import and storage

The next step in our analysis is to read the data into R using the readIllumina function. Thebead-level data you will need for this example are available in the file beadlevelbabfiles.zip(133 MB) which is located in the extdata/BeadLevelBabFiles directory of the BeadArrayUse-Cases package. These files were produced using the BeadDataPackR package in Bioconductorwhich provides a more compact representation of bead-level data.

Use Case: Read the sample information and bead-level data (stored in compressed bab format)into R.

1 > library(beadarray)

2 > chipPath <- system.file("extdata/Chips", package = "BeadArrayUseCases")

3 > list.files(chipPath)

4 > sampleSheetFile <- paste(chipPath , "/sampleSheet.csv",

5 + sep = "")

6 > readLines(sampleSheetFile)

7 > data <- readIllumina(dir = chipPath , sampleSheet = sampleSheetFile ,

8 + useImages = FALSE , illuminaAnnotation = "Humanv3")

Usually the directory will be in a known location, but for convenience we use the system.file

function this directory and the sample sheet. The section names to be read are then constructedusing the contents of the sample sheet.The final line executes the reading of the data. By default readIllumina will look in thecurrent working directory and try to find all BeadArray-associated files. The function willread any text (or .bab) and TIFF files (if instructed to do so) and save the names of any locsand sdf files for future reference.Here we have used the dir and sectionNames arguments to specify what directory to lookin and the names of the sections that should be imported. By setting useImages to FALSE,we use the intensity values stored in the text files, rather than recomputing them from theimages. The argument illuminaAnnotation is a character string to identify the organism andannotation revision number of the chip being analysed, but not the number of samples on thechip. Hence the value Humanv3 can be used for both Human WG-6 and HumanHT12 v3 data.Setting this value correctly will allow beadarray to identify what bead types are to be used forcontrol purposes and convert the numeric ArrayAddressIDs to a more commonly-used format.If you are unsure of the correct annotation to use, this argument can be left blank at this stage.

Z If the data to be imported are not in .bab format, specifying the same arguments toreadIllumina will search for files of type .txt instead.

Use of the sectionNames argument is recommended when the directory containing bead-level data contains files other than those produced by the Illumina scanner. Extraneous filesin such directories may confuse the readIllumina function.

6

Storing and reading bead-level data requires a considerable amount of disk space and RAM.For this example, around 1.1 GB of RAM is required to read in and store the data. Onecould process bead-level data in smaller batches if memory is limited and then combine at thesummary-level.

Selecting the annotation for a dataset

If you are unsure of the correct annotation to use and thus left the illuminaAnnotation

argument to readIllumina as the default, suggestAnnotation can be employed to identifythe platform, based on the ArrayAddressIDs that are present in the data. The setAnnotationfunction can then be used to assign this annotation to the dataset.

Use Case: Verify the version number of the dataset that has been read in and set the annotationof the bead-level data object accordingly.

1 > suggestAnnotation(data , verbose = TRUE)

2 > annotation(data) <- "Humanv3"

2� You should see that the percentage of overlapping IDs is greatest for the Humanv3 plat-form.

Z The result of suggestAnnotation only gives guidance about which annotation to use.Hence, the results may be unpredictable on custom arrays, or arrays that are not listed inthe suggestAnnotation output.

The beadLevelData class

Once imported, bead-level data are stored in an object of class beadLevelData. This class canhandle raw data from both single-channel and two-color BeadArray platforms.

1 > slotNames(data)

The command above gives us an overview of the structure of the beadLevelData class, whichis composed of several slots. The experimentData, sectionData and beadData slots can beviewed as a hierarchy, with each containing data at a different level. Each can be accessedusing the @ operator.

The experimentData slot holds information that is consistent across the entire dataset. Quan-tities with one value per array-section are stored in the sectionData slot. For instance, anymetrics information read by readIllumina, along with section names and the total numberof beads, will be stored there. This is also a convenient place to store any QC informationderived during the preprocessing of the data. The data extracted from the individual text filesare stored in the beadData slot.

Accessing data in a beadLevelData object

Use Case: Output the data stored in the sectionData slot, and determine the section namesand number of bead intensities available from each section. Access the intensities, x-coordinates

7

and probe IDs for the first 5 beads on the first array section.

1 > data@sectionData

2 > sectionNames(data)

3 > numBeads(data)

4 > head(data [[1]])

5 > getBeadData(data , array = 1, what = "Grn")[1:5]

6 > getBeadData(data , array = 1, what = "GrnX")[1:5]

7 > getBeadData(data , array = 1, what = "ProbeID")[1:5]

Using the @ operator to access the data in particular slots is not particularly convenient orintuitive. The functions sectionNames, numBeads and getBeadData provide more convenientinterfaces to the beadLevelData object to retrieve specific information.The first line above uses the @ operator to access all the data in the sectionData slot, whichcan be quite large and unwieldy. The commands below it (lines 2-3) are accessor functions forretrieving a specific subset of data from the same slot.

Line 4 shows that if a beadLevelData object is accessed in the same fashion as a list, a dataframe containing the bead-level data for the specified array is returned. To access a specificentry in this data frame, we can use a further subset, or the data can be accessed using thegetBeadData function. In addition to the beadLevelData object, you need to specify the section(array=...) of interest and the column heading you want.

Extracting transformed data

In this example, the data stored in the beadLevelData object by readIllumina are extracteddirectly from the Illumina text files. The values in the Grn vector are intensity values inferredfrom a known location in the scanned image and there are a number of steps involved beforethese can be translated into quantities that relate to the expression levels. The scanner gener-ally produces values in the range 0 to 216−1, although the image manipulation and backgroundsubtraction steps can lead to values outside this range. This is not a convenient scale for visual-ization and analysis and it is common to convert intensities onto the approximate range 0 to 16using a log2 transformation (possibly after an additional step to adjust non-positive intensities).

Although this is simple to do in isolation using R’s built in functions, it becomes more com-plicated within a function, leading to a large number of arguments being required in order tospecify whether the function should process the green or red channel, use foreground or back-ground intensities, convert to the log2 scale etc. Even in this situation the user is restricted tothe options that are provided by the arguments.

A more flexible way to obtain transformed per-bead data from a beadLevelData object is todefine a transformation function that takes as arguments the beadLevelData object and anarray index. The function then manipulates the data in the desired manner and returns avector the same length as the number of beads on the array. Many of the plotting and qualityassessment functions within beadarray take such a function as one of their arguments. Byusing such a system, beadarray provides a great deal of flexibility over exactly how the data isanalysed.

8

Use Case: Extract the green intensities on the log2 scale for the first 10 probes from the firstarray section.

1 > log2(data [[1]][1:10 , "Grn"])

2 > log2(getBeadData(data , array = 1, what = "Grn")[1:10])

3 > logGreenChannelTransform(data , array = 1)[1:10]

The above example shows three different ways of obtaining the log green channel intensity data.Lines 1 and 2 use R’s log2 function on data extracted using the methods we’ve already seen.The third entry uses one of beadarray’s built in transformation functions. To view an exampleof how a transformation function is defined you can enter the name of one of beadarray’spre-defined functions without any parentheses or arguments.

1 > logGreenChannelTransform

function (BLData , array)

{

x = getBeadData(BLData , array = array , what = "Grn")

return(log2.na(x))

}

<bytecode: 0x560013db4808 >

<environment: namespace:beadarray >

Z In addition to the logGreenChannelTransform function shown above, beadarray providespredefined functions for extracting the green intensities on the unlogged scale(greenChannelTransform), analogous functions for two-channel data(logRedChannelTransform, redChannelTransform), and functions for computing the log-ratio between channels (logRatioTransform).

Analysis of raw data when the images are available

The bead-level expression intensity values that Illumina’s software provides (i.e. those storedin the .txt or .bab files) are the result of a certain amount of preprocessing and so are notstrictly the raw data. In most situations, these values are sufficient for our use, but we mayon occasion wish to begin from the image file, either to reassure ourselves that there are noconcerns or to address a problem that has become manifest.It is important then that we understand what preprocessing steps Illumina apply by default.These are well-documented elsewhere, but to summarize: a local background value is calculatedby taking the mean of the five lowest intensity pixels from a square around the bead. A filter isthen applied to the image (to concentrate the intensities in the centre of beads) and foregroundvalues are calculated as a weighted sum of the intensities in a 4 × 4 square around the beadcentre. It is worth noting that the filter applied to the image, a sharpening filter, contraststhe value of a pixel with the pixels surrounding it, and as such itself could be viewed as abackground correction step. The final intensity for a bead is then calculated as the foregroundvalue minus the local background value.By starting from the image we can adjust any of these steps (e.g. for the background we canadjust the size of the area around the bead or the function applied to it, for the foreground wecan change the filtering step or adjust the pixel weighting scheme, or finally we can use a more

9

sophisticated measure of intensity than ‘foreground minus local background’ to avoid negativevalues. beadarray includes three functions that closely mirror the processing performed byIllumina: illuminaBackground, illuminaSharpen and illuminaForeground.We will illustrate a change to the Illumina process that we recommend if beginning from theimage.

Use Case: Identify abnormally low intensity pixels and then plot a section of the image that illus-trates the benefit of adjusting the standard analysis.

1 > TIFF <- readTIFF(system.file("extdata/FullData/4613710052_B_Grn.tif",

2 + package = "BeadArrayUseCases"))

3 > cbind(col(TIFF)[which(TIFF == 0)], row(TIFF)[which(TIFF ==

4 + 0)])

5 > xcoords <- getBeadData(data , array = 2, what = "GrnX")

6 > ycoords <- getBeadData(data , array = 2, what = "GrnY")

7 > par(mfrow = c(1, 3))

8 > offset <- 1

9 > plotTIFF(TIFF + offset , c(1517, 1527), c(5507, 5517),

10 + values = T, textCol = "yellow", main = expression(log [2]( intensity +

11 + 1)))

12 > points(xcoords [503155] , ycoords [503155] , pch = 16, col = "red")

13 > plotTIFF(TIFF + offset , c(1202, 1212), c(13576 , 13586) ,


15 + 1)))


17 > plotTIFF(TIFF + offset , c(1613, 1623), c(9219, 9229),


19 + 1)))


2� There are three pixels of value zero in the image that must be an imaging artefact ratherthan a true measure of intensity for that location (note that we add an offset of 1 to avoidtaking the log of zero). These pixels will bias the estimate of background for all nearby beadsand so we will try a more robust estimate of the background. Rather than using the meanof the five lowest pixel values, we will use the median (equivalently, the third lowest pixelvalue). This is done using the medianBackground function.

Use Case: Calculate a robust measure of background for this array and store it in a new slot”GrnRB”.

1 > Brob <- medianBackground(TIFF , cbind(xcoords , ycoords ))

2 > data <- insertBeadData(data , array = 2, what = "GrnRB",

3 + Brob)

The medianBackground function takes two arguments, the first of which is the image itself andthe second is a two-column data frame specifying the bead-centre coordinates. We then useinsertBeadData to add the new values to the existing beadLevelData object.

Because the presence of extremely low intensity beads is a known issue, the authors haveprovided the medianBackground function within beadarray to perform an alternative local

10

background correction. However the same methodology can be used to perform the imageprocessing in any way the user sees fit.

Use Case: Calculate foreground values in the normal way, and subtract the median backgroundvalues to get locally background corrected intensities.

1 > TIFF2 <- illuminaSharpen(TIFF)

2 > IllF <- illuminaForeground(TIFF2 , cbind(xcoords , ycoords ))

3 > data <- insertBeadData(data , array = 2, what = "GrnF",

4 + IllF)

5 > data <- backgroundCorrectSingleSection(data , array = 2,

6 + fg = "GrnF", bg = "GrnRB", newName = "GrnR")

We could have chosen a more sophisticated background correction method (approximately ahundred beads end up with a negative intensity using a crude subtraction) but this sufficesto illustrate the point. The majority of the beads have an intensity that barely changes, andthose that do we can attribute to the effect of these problematic pixels.

Use Case: Compare the Illumina intensity with the robust intensity we have just calculated, plotthe locations of beads whose expressions change substantially, and overlay the locations of theimplausibly low-intensity pixels in red.

1 > oldG <- getBeadData(data , array = 2, "Grn")

2 > newG <- getBeadData(data , array = 2, "GrnR")

3 > summary(oldG - newG)

4 > par(mfrow = c(1, 2))

5 > plot(xcoords [(abs(oldG - newG) > 50)], ycoords [(abs(oldG -

6 + newG) > 50)], pch = 16, xlab = "X", ylab = "Y", main = "entire array")

7 > points(col(TIFF)[TIFF < 400], row(TIFF)[TIFF < 400],

8 + col = "red", pch = 16)

9 > plot(xcoords [(abs(oldG - newG) > 50)], ycoords [(abs(oldG -

10 + newG) > 50)], pch = 16, xlim = c(1145, 1180) , ylim = c(15500 ,

11 + 15580) , xlab = "X", ylab = "Y", main = "zoomed in")

12 > points(col(TIFF)[TIFF < 400], row(TIFF)[TIFF < 400],

13 + col = "red", pch = 16)

Of course, in practice we would probably save the new intensities in the Grn column of a sectionin the beadData slot so as not to use more memory than necessary, nor to cause confusion lateron.

The image file may also be of value for the purposes of quality control and assessment. Inparticular, by plotting the recorded co-ordinates of the bead centres over the TIFF image,we can visually check that the image has been correctly registered, however there are otherapproaches to checking this as will be seen in the next section.

11

Quality assessment for raw and bead-level data

Boxplots of intensity values

Boxplots are routinely used to assess the dynamic range of each sample and look for unusualsignal distributions.

Use Case: Create boxplots of the green channel intensities for all arrays.

1 > boxplot(data , transFun = logGreenChannelTransform , col = "green",

2 + ylab = expression(log [2]( intensity)), las = 2, outline = FALSE ,

3 + main = "HT -12 MAQC data")

The boxplot function is a standard function in R that we have extended to work for thebeadLevelData class. Consequently, the standard parameters to boxplot, such as changingthe title of the plot, scale and axis labels are possible, some of which are shown in the finalfive arguments above. The help page for the par function provides information on these andother arguments that can be supplied to boxplot. The only beadarray specific argument is thesecond, transFun, which takes a transformation function of the format shown previously. Inthis case we have selected to use the log2 of the green channel, which is also the default.

2� Most arrays have a similar distribution of intensities with a median value of around 5.7.Array 4616443081 B has a lower median and IQR than other arrays and 4616443081 H hasa higher median and IQR. Further quality assessment will focus on these arrays and whetherthey should be excluded from the analysis.

Visualizing intensities across an array surface

The combination of both an intensity and a location for each bead on the array allows us tovisualize how the intensities change across the array surface. This kind of visualization is notpossible when using the summarized output, as the summary values are averaged over spatialpositions. The imageplot function can be used to create false color representations of thearray surface.

Use Case: Produce a graphic with imageplots of all arrays in the dataset, with each image onthe same scale.

1 > par(mfrow = c(6, 2))

2 > par(mar = c(1, 1, 1, 1))

3 > for (i in 1:12) {

4 + imageplot(data , array = i, high = "darkgreen", low = "lightgreen",

5 + zlim = c(4, 10), main = sectionNames(data)[i])

6 + }

The imageplot function has a number of arguments; the first two shown above are the objectcontaining the data to be visualized and the index of the desired array. The high and low

arguments specify the colors representing the extreme values. The function automaticallyinterpolates the colors for values in between. The use here of zlim ensures that the color

12

range is the same between arrays. Any value that falls outside this range will be shown in thesame color as these limits. This is beneficial for making comparisons between arrays, and canprevent minor variations, on otherwise perfectly acceptable arrays, from being exaggerated.By contrast, we need to be cautious that the use of zlim may be detrimental if the same limitsare not appropriate for each array, which could lead to some spatial artefacts being overlooked.This is especially the case for our example, since these arrays have not been normalized.

2� White patches on the imageplot are due to beads that could not be (or were not) decodedfor the end-user by Illumina. As each array is randomly constructed, decoding takes placeat Illumina before the BeadChips are supplied, in order to identify the sequence attached toeach bead. Beads that could not be decoded are not present in the bead-level text files (nordo they contribute to Illumina’s summary data). Hence their intensities are not availablefor display on the imageplot. You should notice obvious spatial artefacts on arrays 8 and 12(4616443081 H and 4616494005 A).

Plotting the location of outliers

Recall that the BeadArray technology includes many replicates (typically ∼ 20 of each probetype in each sample on an HT-12 array). BeadStudio/GenomeStudio removes outliers greaterthan 3 median absolute deviations (MADs) from the median prior to calculating summaryvalues.

Use Case: Plot the location of outliers on the arrays with the most obvious spatial artefacts andplot their location.

1 > par(mfrow = c(2, 1))

2 > for (i in c(8, 12)) {

3 + outlierplot(data , array = i, main = paste(sectionNames(data)[i],

4 + "outliers"))

5 + }

By default, the outlierplot function uses Illumina’s ‘three MADs from the median’ rule,but applied to log-transformed data. It then plots the identified outliers by their locationon the surface of the array section. Of course, the user can specify alternative rules andtransformations and the function is additionally able to accept arguments to plot such asmain.

2� In this example, the regions of the arrays that appear to be spatial artefacts are also flaggedas outliers, which would be removed before creating summarized values for each probe as re-ported in BeadStudio/GenomeStudio. Blue points represent outliers which are below aver-age, while pink points are outliers above the average (these colors can be set by the user -refer to the outlierplot help page for details).The dotted red lines running vertically indicate the segment boundaries, with each Humanv3

section made up of 9 segments that are physically separated on the section surface. The lo-cations of these segments are taken from an sdf file, or can be specified manually if the sdffile is not available.

13

Excluding beads affected by spatial artefacts

It should be apparent that some arrays in this dataset have significant artefacts and, althoughit appears that most beads in these regions are classed as outliers, it would be desirable tomask all beads from these areas from further analysis. Our preferred method for doing this isto use BASH [3], which takes local spatial information into account when determining outliers,and uses replicates within an array to calculate residuals.

BASH performs three types of artefact detection in the style of the affymetrix-oriented Harsh-light [4] package: Compact analysis identifies large clusters of outliers, where each outlyingbead must be an immediate neighbour of another outlier; Diffuse analysis finds regions thatcontain more outliers than would be anticipated by chance, and Extended analysis looks forchip-wide variation, such as a consistent gradient effect.

The output of BASH is a list containing a variety of data, including a list of weights indicat-ing the beads that BASH has identified as problematic. These weights may be saved to thebeadLevelData object for future reference by using the setWeights function. The locations ofthe masked beads can be visualized using the showArrayMask function.

Use Case: Run BASH on the two arrays identified previously to have the most serious spatialartefacts, mask the affected beads and visualize the regions that have been excluded. How manybeads does BASH mask on the two arrays?

1 > BASHoutput <- BASH(data , array = c(8, 12))

2 > data <- setWeights(data , wts = BASHoutput$wts , array = c(8,

3 + 12))

4 > head(data [[8]])

5 > par(mfrow = c(1, 2))

6 > for (i in c(8, 12)) {

7 + showArrayMask(data , array = i, override = TRUE)

8 + }

9 > table(getBeadData(data , what = "wts", array = 8))

10 > table(getBeadData(data , what = "wts", array = 12))

11 > BASHoutput$QC

Line 1 calls the BASH function, with the array argument specifying that it should be run onarrays 8 and 12. If this argument was not specified the default behaviour is to process eacharray. As mentioned above, the output from BASH is a list with several entries. The $wts entryis a further list, where each entry is a vector of weights (one value per-bead) with 0 indicatinga that a bead should be masked.

We then use the setWeights function to assign these weights to our beadLevelData object.Line 4 shows how the setWeights function has added an additional column to the specifiedarray in the beadLevelData object.

Finally we call showArrayMask, which creates a plot similar to outlierplot mentioned earlier.In addition to displaying the beads classed as outliers, showArrayMask shows in red the beads

14

that are currently masked from further analysis. By default the function refuses to create theplot if over 200,000 beads have been masked, as this can cause considerable slowdown on oldercomputers, so the override argument has been used in this example to force the plot creation.Finally, we can use the getBeadData function to see how many beads were assigned a weightof 0 (i.e. completely masked) on the arrays. The numbers can also retrieved from the QCinformation that BASH returns.

2� You should see that the setWeights has added an extra wts column into the beadLevel-

Data object. The positions of the masked beads are indicated in red in the result of showAr-rayMask and should agree well with the outlier locations, however BASH should have identi-fied more beads than straightforward outlier removal may have missed.

Running BASH for this example uses around 2.2 GB of RAM and takes 10 minutes persample on our computer system. If you are running short on time, you may wish to skip thisexercise.

Z The different components of BASH to find compact or diffuse defects plus the extendedscore analysis can be run separately; see the BASHCompact, BASHExtended and BASHExtended

functions.

Z Try running the BASHExtended function for some of the other arrays in this dataset. e.g.

BASHExtended(data, array=1)

You should see extended scores of around 0.1.

Removing intensity gradients

The Extended score returned by BASH in the previous use case gives an indication of the level ofvariability across the entire surface of the chip. If this value is large it may indicate a significantgradient effect in the intensities. The HULK function can be used to smooth out any gradientsthat are present.HULK uses information about neighbouring beads, but rather than mask them out as in BASH,it adjusts the log-intensities by the weighted average of residual values within a local neigh-bourhood.

Use Case: Run HULK on the first array, and replace the original intensities with the gradient ad-justed values.

1 > HULKoutput <- HULK(data , array = 1, transFun = logGreenChannelTransform)

2 > data <- insertBeadData(data , array = 1, data = HULKoutput ,

3 + what = "GrnHulk")

Similar to BASH, the primary argument to HULK is a beadLevelData object. It also takes atransformation function, allowing the intensity adjustment to be performed on any data storedwithin the object.

15

Typically, we would run BASH followed by HULK on a dataset. However, the order inwhich one does BASH and HULK could be a topic for research. In cases of severe gradientsacross the array, you might get a lot of beads masked at one edge of the array. However, ifHULK were run first these beads might be saved.

Checking image registration

Tools such as BASH and HULK are of no use if the process of generating the image and findingbeads as the array is scanned (known as ‘registration’) fails. We have previously encounteredarrays where the position of the beads within the image was not found correctly [5], resultingin the identity of all beads being scrambled. The function checkRegistration can be used toidentify chips where such mis-registration may have taken place.

This functionality is only available in beadarray version 2.3.0 or newer

1 > registrationScores <- checkRegistration(data , array = c(1,

2 + 7))

3 > boxplot(registrationScores , plotP95 = TRUE)

The checkRegistration function takes a beadLevelData object as it’s primary argument, alongwith the indices of the array sections that should be checked. The output from checkRegis-

tration is an object of class beadRegistrationData. We have extended the standard boxplot

function to accept this class, providing an easy way to visualise the registration scores.

2� The registration scores are generated by looking at the the difference between the withinbead-type variance for the given bead IDs and a randomised set of IDs. If the registrationhas been sucessful you expect this value to be greater than zero. Any section where the me-dian registration score is close to zero is of concern. There are two reasons why a medianvalue of zero may be seen, either then array was misregistered or there are no beads presentin the image. The plotP95 argument shown above allows these cases to be distinguished.

Plotting control probes

Illumina have designed a number of control probes for each BeadArray platform. Two partic-ular controls on expression arrays are housekeeping and biotin controls, which are expected tobe highly expressed in any sample. With the poscontPlot function, we can plot the intensitiesof any ArrayAddressIDs that are annotated as belonging to the Housekeeping or Biotin controlgroups.

Use Case: Generate plots of the housekeeping and biotin controls for the 6th, 7th, 8th and 12tharrays from this dataset.

1 > par(mfrow = c(2, 2))

2 > for (i in c(6, 7, 8, 12)) {

3 + poscontPlot(data , array = i, main = paste(sectionNames(data)[i],

4 + "Positive Controls"), ylim = c(4, 15))

5 + }

16

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

4616443115_B Positive Controls

log2

inte

nsity

410

1430

239

2940

403

4490

161

5900

110

6510

136

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●●

●

●

●

●●●●●

●

●●●●●●●●●

●●

●●●●●●●●●●●●●●

●

●●●●●●●●

●

●●●●● ●●●

●

●●

●●

●

●

●●

●

●●●●●●●●●●●

●

●●●●●●●●●

●

●●●●

●

●●

●

●●●●●●●●●●●●●●●●●●●

●

●●●●

4616443081_B Positive Controls

log2

inte

nsity

410

1430

239

2940

403

4490

161

5900

110

6510

136

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●

●

●●●

●

●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●

●

●

●●●●

●

●●●●●●●●●●●●●●●●●●

●

●●●

●●

●●●

●

●●●●●●●●●●●●●

●●●●●●●●●●●●

●

●

●●

●

●●●●●

●

●●●●●● ●

●●

●

●●●●●●●●●

●●●●●●●

●

●

●●●●●●●●●

●●●

●●●●●●

●

●●●●

●

●●●●●●●●●

● ●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●

●●●●

●●

●●

●

●●●●

●

●●

●

●

●●●●

●

●

●

●●●●●●●●●●●●●●●

●●●●●●●

●

●●●●●

●

●

●

●●

●●●

●●●●

●●

●●●

●●●●●

●●●●

●

●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●

●●●●●●●●●● ●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●

●

●

●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

4616443081_H Positive Controls

log2

inte

nsity

410

1430

239

2940

403

4490

161

5900

110

6510

136

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

4616494005_A Positive Controls

log2

inte

nsity

410

1430

239

2940

403

4490

161

5900

110

6510

136

Provided the annotation of the beadLevelData object has been correctly set, the poscontPlot

function should be able to identify the revelant probes and intensities. This code generatespositive controls plots for four arrays in the dataset; one good quality array and three that wehave noted problems with.

2� For the good quality array, the control probes are highly-expressed as expected. For thearray with poor scanner metrics, all the housekeeping beads are lowly expressed but the Bi-otin controls are highly-expressed, indicating successful staining, but unsuccessful hybridiza-tion. For the arrays with severe artefacts, 4616443081 H has a large spread of values for thecontrol probes, indicating that much of the array is affected by the artefact. On the otherhand, 4616494005 A shows high expression of the control probes, giving hope that this arraycan be used in further analysis.

17

Producing control tables

With knowledge of which ArrayAddressIDs match different control types, we can easily providesummaries of these control types on each array. In quickSummary the mean and standarddeviation of all control types is calculated for a specified array, using intensities of all beads thatcorrespond to the different control types. Note that these summaries may not correspond tosimilar quantities reported in Illumina’s BeadStudio/GenomeStudio software, as the summariesare produced after removing outliers. The makeQCTable function extends this functionality toproduce a table of summaries for all sections in the beadLevelData object. These data canbe stored in the sectionData slot for future reference. It is also informative to compare theexpression level of various control types to the background level of the array. This is done bythe controlProbeDetection function that returns the percentage of each control type thatare significantly expressed above background level. For positive controls we would prefer thisto be near 100% on a good quality array.

Use Case: Summarize the control intensities for the first array, then tabulate the mean and stan-dard deviation of all control probes on every array.

1 > quickSummary(data , array = 1)

2 > qcReport <- makeQCTable(data)

3 > head(qcReport)[, 1:5]

The above code generates a quality control summary for a single array (Line 1), then for allarrays in the beadLevelData object using the makeQCTable function.

2� You should notice that the housekeeping controls are lower for array 7, as we have notedin previous quality assessments.

Saving control tables to a bead-level object

The insertSectionData function allows the user to modify the sectionData slot of abeadLevelData object. We can use this functionality to store any quality control (QC) valuesthat we have computed.

Use Case: Store the computed QC values into the bead-level data object.

1 > data <- insertSectionData(data , what = "BeadLevelQC",

2 + data = qcReport)

3 > names(data@sectionData)

The insertSectionData function requires a data frame with the same number of rows as thenumber of sections in the beadLevelData object. The what parameter is used to assign a nameto the data in the sectionData slot.

Z You could also save the QC table to disk using the write.csv function (or similar).

Defining additional control probes

beadarray allows flexibility in the way that control reports are generated. For instance, users

18

are able to define their own control reporters. We have observed that certain probes locatedon the Y chromosome are beneficial in discriminating the gender of samples in a population.Below we provide an example of how these probes can be incorporated into a QC report. Thisutilizes the fact that internally, beadarray uses a very simple matrix to assign ArrayAddressIDsto control types.

1 > cprof <- makeControlProfile("Humanv3")

2 > sexprof <- data.frame(ArrayAddress = c("5270068", "1400139",

3 + "6860102"), Tag = rep("Gender", 3))

4 > cprof <- rbind(cprof , sexprof)

5 > makeQCTable(data , controlProfile = cprof)

Line 1 obtains the control information currently in use for the Humanv3 platform. We thencreate a new data frame with three rows, each representing a probe from chromosome Y, androw-bind this to the Humanv3 control object. The makeQCTable is able to accept the newobject as an argument, rather than using the default Humanv3 profile.

Generating QC reports

The generation of quality assessment plots for all sections in the beadLevelData object is pro-vided by the expressionQCPipeline function. Results are generated in a directory of theusers choosing. This report may be generated at any point of the analysis. If the overWrite

paramater is set to FALSE, then any existing plots in the directory will not be re-generated.Futhermore, QC tables that have been stored in the beadLevelData object already can be used.

Use Case: Generate a QC report for all arrays.

1 > expressionQCPipeline(data)

The above code runs the QC pipeline with the default options. However, many aspects of theexpressionQCPipeline function are configurable, such as the transformation function appliedto the data, or the identities of the controls.

The expressionQCPipeline function will attempt to create graphics and HTML files inthe specified directory, so it is important that this directory is writable by the user.

Summarizing the data

The summarization procedure takes the beadLevelData object, where each probe is replicateda varying number of times on each array, and produces a summarized object which is moreamenable for making comparisons between arrays. For each array section represented in thebeadLevelData object, all observations are extracted, transformed, and then grouped togetheraccording to their ArrayAddressID. Outliers are removed and the mean and standard deviationof the remaining beads are calculated.

There are many possible choices for the extraction, transformation and choice of summarystatistics and beadarray allows users to experiment with different options via the definitionof an illuminaChannel class. For expression data, the green intensities will be the quantities

19

to be summarised. However, the illuminaChannel class is designed to cope with two-channeldata where the green or red (or some combination of the two) may be required with minimalchanges to the code. The summarize function is used to produce summary-level data and hasthe default setting of performing a log2 transformation.

Use Case: Generate summary-level data for this dataset on the log2 and unlogged scale.

1 > datasumm <- summarize(BLData = data)

2 > grnchannel.unlogged <- new("illuminaChannel", transFun = greenChannelTransform ,

3 + outlierFun = illuminaOutlierMethod , exprFun = function(x) mean(x,

4 + na.rm = TRUE), varFun = function(x) sd(x, na.rm = TRUE),

5 + channelName = "G")

6 > datasumm.unlogged <- summarize(BLData = data , useSampleFac = FALSE ,

7 + channelList = list(grnchannel.unlogged ))

Line 1 uses the default settings of summarize to produce summary-level data. Line 2 shows thefull gory details of how to modify how the bead-level data are summarized by the creation of anilluminaChannel object. We use a transformation function that just returns the Grn intensities,Illumina’s default outlier function and modified mean and standard deviation functions thatremove NA values. This new channel definiton is then passed to summarize. In this call we alsoexplicitly set the useSampleFac argument to FALSE. The useSampleFac parameter should beused in cases where multiple sections for the same biological sample are to be combined, whichis not applicable in this case.

2� summarize produces verbose output which firstly gives details on how many sections areto be combined (none in this case) and how many ArrayAddressIDs are to be summarized.Notice that we summarize all arrays in beadLevelData object, even though we may not usethe poor quality arrays in the analysis. This is just for convenience as each array is summa-rized independently (so there is no way of the data from a poor quality array to contaminatethe data from other arrays) and it is simpler to remove outlier arrays from the summarizedobjects.

Z For WG-6 arrays, which have two sections per biological sample, BeadStudio combinesthe two sections together prior to calculating means and standard deviations. It is possibleto mimic this behaviour by setting the useSampleFac = TRUE argument in summarize. Thiseither uses information from the sdf file (if present) or the value of the sampleFac argument.However, we recommend summarizing each section separately.

Z If we wanted to have the un-logged and log2 intensities in the same summary object, wecould have used supplied both channels in a list as an argument to summarize.

datasumm.all <- summarize(data, list(grnchannel, grnchannel.unlogged),

useSampleFac=FALSE)

Z During the summarization process, the numeric ArrayAddressIDs used in the beadLevel-Data object are converted to the more commonly-used Illumina IDs. However, control probesretain their original ArrayAddressIDs and any IDs that cannot be converted (e.g. internalcontrols used by Illumina for which no annotation exists) are removed unlessremoveUnMappedProbes = TRUE is specified.

20

The ExpressionSetIllumina class

Summarized data are stored in an object of type ExpressionSetIllumina which is an extensionof the ExpressionSet class developed by the Bioconductor team as a container for data fromhigh-throughput assays.

Objects of this type use a series of slots to store the data. For consistency with the definitionof other ExpressionSet objects, we refer to the expression values as the exprs matrix (thisstores the probe-specific average intensities) which can be accessed using exprs and subset inthe usual manner. The se.exprs matrix, which stores the probe-specific variability can beaccessed using se.exprs, and phenotypic data for the experiment can be accessed using pData.To accommodate the unique features of Illumina data we have added an nObservations slot,which gives the number of beads that we used to create the summary values for each probeon each array after outlier removal. The detection score, or detection p-value is a standardmeasure for Illumina expression experiments, and can be viewed as an empirical estimate of thep-value for the null hypothesis that a particular probe in not expressed (for a more completedefinition, refer to section 2). These can be calculated for summarized data provided that theidentity of the negative controls on the array is known using the function calculateDetection.

Use Case: Produce boxplots of the summarized data and calculate detection scores.

1 > dim(datasumm)

2 > exprs(datasumm )[1:10 , 1:2]

3 > se.exprs(datasumm )[1:10 , 1:2]

4 > par(mai = c(1.5, 1, 0.2, 0.1), mfrow = c(1, 2))

5 > boxplot(exprs(datasumm), ylab = expression(log [2]( intensity)),

6 + las = 2, outline = FALSE)

7 > boxplot(nObservations(datasumm), ylab = "number of beads",

8 + las = 2, outline = FALSE)

9 > det <- calculateDetection(datasumm)

10 > head(det)

11 > Detection(datasumm) <- det

The dim function has been extended to report the key dimensions of the data, namely thenumber of probes and samples. The expression matrix and associated probe-specific variabilityare returned by lines 2 and 3 (However, see the note below about the se.exprs function)The calculateDetection function uses the annotation information stored in the Expression-SetIllumina object to identify the negative control and do the detection calculation, giving adetection value for each probe on each array.

2� The dimensions should be reported as 49,576 Features (probes) and 12 samples and 1channel. The boxplot of the expression matrix should agree with your observations from thebead-level data. The boxplot showing the distribution of bead numbers used in the calcula-tion of the summary values for each array can reveal arrays with significant spatial artefacts(arrays 8 and 12 in this case).

Pay attention to the scale on the y-axis. The median level of expression should be some-where around 5 to 6 with the lowest values around 4. If the median level is around 2 to 3 youmay have logged the data twice.

21

The se.exprs function has been named to be consistent with existing Bioconductor func-tions. However, it may not always return standard errors as its name suggests. The data re-turned will depend on the definition of the illuminaChannel class. In the example above, theline

>se.exprs(datasumm)[1:10,1:2]

returns standard deviations, which must be divided by the sqrt of the nObservations valuesto report standard errors.

Z calculateDetection assumes that the information about the negative controls is foundin a particular part of the ExpressionSetIllumina object, and takes the form of a vector ofcharacters indicating whether each probe in the data is a control or not. This vector can besupplied as the status argument along with an identifier for the negative controls(negativeLabel).

Concluding remarks

So far we have looked at the kinds of low-level analysis that are possible when you have accessto the raw and bead-level data. Once quality assessment is complete and the probe intensitieshave been summarized, one can continue down the usual analysis path of normalizing betweensamples, and assessing differential expression using limma or other Bioconductor tools. We willleave this task for now and revisit it later.

22

2 Analysis of summary data from BeadStudio/GenomeStudiousing limma

BeadStudio/GenomeStudio is Illumina’s proprietary software for analyzing data output by thescanning system (BeadScan/iScan). It contains different modules for analyzing data from dif-ferent platforms. For further information on the software and how to export summarized data,refer to the user’s manual. In this section we consider how to read in and analyze output fromthe gene expression module of BeadStudio/GenomeStudio.

The example dataset used in this section consists of an experiment with one Human WG-6version 2 BeadChip. These arrays were hybridized with the control RNA samples used in theMAQC project (3 replicates of UHRR and 3 replicates of Brain Reference RNA).The non-normalized data for regular and control probes was output by BeadStudio/GenomeS-tudio.The example BeadStudio output used in this section is available in the fileAsuragenMAQC_BeadStudioOutput.zip which can be downloaded fromtt http://www.switchtoi.com/datasets.ilmn.You will need to download and unzip the contents of this file to the current R working directory.Inside this zip file you will find several files including summarized, non-normalized data and afile containing control information. We give a more detailed description of each of the particularfiles we will make use of below.

� Sample probe profile (AsuragenMAQC-probe-raw.txt) (required) - text file which containsthe non-normalized summary values as output by BeadStudio. Inside the file is a datamatrix with some 48,000 rows. In newer versions of the software, these data are precededby several lines of header information. Each row is a different probe in the experimentand the columns give different measurements for the gene. For each array, we recordthe summarized expression level (AVG Signal), standard error of the bead replicates(BEAD STDERR), number of beads (Avg NBEADS) and a detection p-value (DetectionPval) which estimates the probability of a gene being detected above the background level.When exporting this file from BeadStudio, the user is able to choose which columns toexport.

� Control probe profile (AsuragenMAQC-controls.txt) (recommended) - text file whichcontains the summarized data for each of the controls on each array, which may beuseful for diagnostic and calibration purposes. Refer to the Illumina documentation forinformation on what each control measures.

� targets file (optional) - text file created by the user specifying which sample is hybridizedto each array. No such file is provided for this dataset, however we can extract sampleannotation information from the column headings in the sample probe profile.

Files with normalized intensities (those with avg in the name), as well as files with one inten-sity value per gene (files with gene in the name) instead of separate intensities for differentprobes targeting the same transcript, are also available in this download. We recommend userswork with the non-normalized probe-specific data in their analysis where possible. Illumina’s

23

background correction step, which subtracts the intensities of the negative control probes fromthe intensities of the regular probes, should also be avoided.

Use Case: Read the example files using the read.ilmn function. Work out how many differentclasses of probes are present on these Human arrays.

1 > library(limma)

2 > maqc <- read.ilmn(files = "AsuragenMAQC -probe -raw.txt",

3 + ctrlfiles = "AsuragenMAQC -controls.txt", probeid = "ProbeID",

4 + annotation = "TargetID", other.columns = c("Detection Pval",

5 + "Avg_NBEADS"))

6 > dim(maqc)

7 > maqc$targets

8 > maqc$E[1:5, ]

9 > table(maqc$genes$Status)

The files argument is the only compulsory argument in read.ilmn. The directory where theprobe profile or control files is located can be specified using the path and ctrlpath argumentsrespectively.

2� The intensities are stored in an object of class EListRaw defined in the limma package.The dimensions of this object should be 50,127 rows and 6 columns; in other words, thereare 50,127 probes and 6 arrays. Each row in the object has an associated status which deter-mines if the intensities in the row are for a control probe, or a regular probe that we mightwant to use in an analysis. The maqc$targets data frame indicates the biological sampleshybridized to each array.

Preprocessing, quality assessment and filtering

The controls available on the Illumina platform can be used in the analysis to improve inference.As we have already seen in section 1, positive controls can be used to identify suspect arrays.Negative control probes, which measure background signal on each array, can be used to assessthe proportion of expressed probes that are present in a given sample [6]. The propexpr

function estimates the proportion of expressed probes by comparing the empirical intensitydistribution of the negative control probes with that of the regular probes. A mixture modelis fitted to the data from each array to infer the intensity distribution of expressed probes andestimate the expressed proportion.

Use Case: Estimate the proportion of probes which are expressed above the level of the negativecontrols on the MAQC samples using the propexpr function. Do you notice a difference betweenthe expressed proportions in the UHRR and Brain Reference RNA samples?

1 > proportion <- propexpr(maqc)

2 > proportion

3 > t.test(proportion [1:3], proportion [4:6])

Z Systematic differences exist between different BeadChip versions, so these proportionsshould only be compared within a given platform type [6]. This estimator has a variety of

24

applications. It can be used to distinguish heterogeneous or mixed cell samples from puresamples and to provide a measure of transcriptome size.

2� The UHRR and Brain Reference samples used in this experiment have a similar propor-tion of expressed probes.

Background correction and normalization

A second use for the negative controls is in the background correction and normalization ofthe data [7]. The normal-exponential convolution model has proven useful in backgroundcorrection of both Affymetrix [8] and two-color data [9]. Having a well-behaved set of negativecontrols simplifies the parameter estimation process for background parameters in this model.Applying this approach to Illumina gene expression data has been shown to offer improvedresults in terms of bias-variance trade-off and reduced false positives [7]. The neqc function[7] in limma fits such a convolution model to the intensities from each sample, before quantilenormalizing and log2 transforming the data to standardize the signal between samples.

Use Case: Apply the neqc function to calibrate the background level, normalize and transformthe intensities from each sample. Make boxplots of the regular probes and negative control probesbefore normalization and the regular probes after normalization and assess whether the neqc pro-cedure improves the consistency between different samples.

1 > maqc.norm <- neqc(maqc)

2 > dim(maqc.norm)

3 > par(mfrow = c(3, 1))

4 > boxplot(log2(maqc$E[maqc$genes$Status == "regular", ]),

5 + range = 0, las = 2, xlab = "", ylab = expression(log [2]( intensity)),

6 + main = "Regular probes")

7 > boxplot(log2(maqc$E[maqc$genes$Status == "NEGATIVE",

8 + ]), range = 0, las = 2, xlab = "", ylab = expression(log [2]( intensity)),

9 + main = "Negative control probes")

10 > boxplot(maqc.norm$E, range = 0, ylab = expression(log [2]( intensity)),

11 + las = 2, xlab = "", main = "Regular probes , NEQC normalized")

2� The neqc preprocessed intensities are stored in an EList object in which the control probeshave been removed, leaving us with 48,701 regular probes. In this dataset, the intensities arefairly consistent to begin with, so calibration, normalization and transformation with neqc

does not dramatically change the intensities on any array.

Z Data exported from BeadStudio/GenomeStudio may already be normalized, however werecommend where possible analyzing the non-normalized intensities, which can be normal-ized in R.

Z Recall that there are 6 samples per WG-6 BeadChip. Boxplots allow within-BeadChiptrends, such as intensity gradients from top to bottom of the chip to be assessed. Differencesbetween BeadChips hybridized at different times may also be expected.

Z The neqc function is also able to accept a matrix as an argument, rather than the limmaEListRaw class. Therefore users who might have read the data using lumi or processed bead-

25

level data with beadarray (as per section 1) will still be able to use this processing method.However, the status, negctrl and regular arguments will need to be set appropriately.The beadarray package includes neqc as an option in its normaliseIllumina function.

When applying quantile normalization, it is assumed that the distribution in signal shouldbe the same from each array. This assumption may be unreasonable in some experiments,and should be carefully checked with diagnostic plots.

Dealing with batch effects

Multidimensional scaling (MDS), assesses sample similarity based on pair-wise distances be-tween samples. This dimension reduction technique uses the top 500 most variable genesbetween each pair of samples to calculate a matrix of Euclidean distances which are used togenerate a 2 dimensional plot. Ideally, samples should separate based on biological variables(RNA source, sex, treatment, etc.), but often technical effects (such as samples processed to-gether on the same BeadChip) may dominate. Principal component analysis (PCA) is anotherdimension reduction technique frequently applied to microarray data.

Use Case: Generate a multidimensional scaling (MDS) plot of the data using the plotMDS func-tion. Assess whether the samples cluster together by RNA source.

1 > plotMDS(maqc.norm$E)

2� In this experiment, the first dimension separates the UHRR samples from the Brain Ref-erence samples. The second dimension separates replicate Brain Reference samples, indicat-ing that these are less consistent than the UHRR samples. The scale for dimension 2 is muchreduced compared to dimension 1, indicating that the underlying biological differences be-tween the two RNA sources explains most of the between sample variation.

Z The plotMDS function accepts a matrix as an argument so could be used on the expres-sion matrix of an ExpressionSetIllumina object, extracted using the exprs function.

Filtering based on probe annotation

Filtering non-responding probes from further analysis can improve the power to detect differ-ential expression. One way of achieving this is to remove probes whose probe sequence hasundesirable properties. Annotation quality information is available from the platform specificannotation packages, which are discussed in more detail later.The illuminaHumanv2.db annotation package provides access to the reannotation informationprovided by Barbosa-Morais et al. [10]. In this paper, a scoring system was defined to quantifythe reliability of each probe based on its 50 base sequence. These mappings are based on theprobe sequence and not the RefSeq ID, as for the standard annotation packages and can giveextra criteria for interpreting the results. For instance, probes with multiple genomic matches,or matches to non-transcribed genomic locations are likely to be unreliable. This informationcan be used as a basis for filtering promiscuous or un-informative probes from further analysis,as shown above.

26

Four basic annotation quality categories (‘Perfect’, ‘Good’, ‘Bad’ and ‘No match’) are definedand have been shown to correlate with expression level and measures of differential expression.We recommend removing probes assigned a ‘Bad’ or ‘No match’ quality score after normal-ization. This approach is similar to the common practice of removing lowly-expressed probes,but with the additional benefit of discarding probes with a high expression level caused bynon-specific hybridization.The illuminaHumanv2.db package is an example of a Bioconductor annotation package builtusing infrastructure within the AnnotationDBi package. More detailed descriptions of how toaccess data within annotation packages (and how it is stored) is given with the AnnotationDbipackage. Essentially, each annotation package comprises a database of mappings between adefined set of microarray identifiers and genomic properties of interest. However, the user ofsuch packages does not need to know the details of the database scheme as convenient wrapperfunctions are provided.

Use Case: Retrieve quality information from the Human v2 annotation package and verify thatprobes annotated as ‘Bad’ or ‘No match’ generally have lower signal. Exclude such probes fromfurther analysis.

1 > library(illuminaHumanv2.db)

2 > illuminaHumanv2 ()

3 > ids <- as.character(rownames(maqc.norm))

4 > ids2 <- unlist(mget(ids , revmap(illuminaHumanv2ARRAYADDRESS),

5 + ifnotfound = NA))

6 > qual <- unlist(mget(ids2 , illuminaHumanv2PROBEQUALITY ,


8 > table(qual)

9 > AveSignal = rowMeans(maqc.norm$E)

10 > boxplot(AveSignal ~ qual)

11 > rem <- qual == "No match" | qual == "Bad"

12 > maqc.norm.filt <- maqc.norm[!rem , ]

13 > dim(maqc.norm)

14 > dim(maqc.norm.filt)

All mappings available in the illuminaHumanv2.db can be listed by the illuminaHumanv2 func-tion. The default keys for each mapping are Illumina IDs that have the prefix ILMN. As therow names in the maqc.norm object are numeric ArrayAddressIDs we first have to convertthem. This is achieved by use of the revmap function which reverses the direction of standardmapping from Illumina ID to ArrayAddressID. The mget function can then be used to querythe probe quality for our new IDs and return the result as a list, which we then convert to avector. The rowMeans is a base function to calculate the mean for each row in a matrix. Wethen make a boxplot of the average signal using probe quality as a factor.

2� The output of illuminaHumanv2() lists all mappings available in the package and theseare divided into two sections. Firstly there are mappings that use the RefSeq ID assignedto each probe to map to genomic properties, and secondly there are mappings based on theprobe sequence itself using the procedure described by Barbosa-Morais et al. [10].You should see that the expression level of the probes annotated as ‘No match’ or ‘Bad’are generally lower than other categories. After applying this filtering step, we are left with

27

31,304 probes (out of a possible 48,701 regular probes) for further analysis.

Z The packages used to annotate Illumina BeadArrays have a very simple convention whichis illumina followed by an organism name, followed by annotation version number. Hence,the above code could be executed for a Humanv3 array by simply replacing v2 with v3.

library(illuminaHumanv3.db)

illuminaHumanv3()

###...

ids2 <- unlist(mget(ids, revmap(illuminaHumanv3ARRAYADDRESS), ifnotfound=NA))

##etc....

You may notice a few outliers in the ‘Bad’ category that have consistently high expression.Some strategies for probe filtering would retain these probes in the analysis, so it is worthconsidering whether they are of value to an analysis.

Use Case: Investigate any IDs that have high expression despite being classed as ‘Bad’.

1 > queryIDs <- names(which(qual == "Bad" & AveSignal > 12))

2 > unlist(mget(queryIDs , illuminaHumanv2REPEATMASK ))

3 > unlist(mget(queryIDs , illuminaHumanv2SECONDMATCHES ))

4 > mget("ILMN_1692145", illuminaHumanv2PROBESEQUENCE)

2� You should see that probes annotated as ‘Bad’ hybridize to multiple places in the genomeand often contain repetitive sequences; making their inclusion in the analysis questionable.The probe sequences themselves can be retrieved and subjected to manual BLAT search (e.g.using the UCSC genome browser). The above example (ILMN 1692145) will return manymatches.

Other data visualisation

Although we will concentrate on a differential expression analysis, there are many other com-mon analysis tasks that could be performed once a normalized expression matrix is available.

Use Case: Pick a set of highly-variable probes and cluster the samples.

1 > IQR <- apply(maqc.norm.filt$E, 1, IQR , na.rm = TRUE)

2 > topVar <- order(IQR , decreasing = TRUE )[1:500]

3 > d <- dist(t(maqc.norm.filt$E[topVar , ]))

4 > plot(hclust(d))

We calculate the interquartile range (IQR) of each probe across all samples and then order tofind the 500 (an arbitrary number) most variable. These probes are then used to cluster thedata in a standard way.

Use Case: Make a heatmap to show the differences between the groups.

28

1 > heatmap(maqc.norm.filt$E[topVar , ])

Not all datasets will show such large differences between biological groups!

Differential expression analysis

The differential expression methods available in the limma package can be used to identifydifferentially expressed genes. The functions lmFit, contrasts.fit eBayes can be applied tothe normalized and filtered data.

Use Case: Fit a linear model to summarize the values from replicate arrays and compare UHRRwith Brain Reference by setting up a contrast between these samples. Assess array quality usingempirical array weights and incorporate these in the final linear model. Is there strong evidence ofdifferential expression between these samples?

1 > rna <- factor(rep(c("UHRR", "Brain"), each = 3))

2 > design <- model.matrix(~0 + rna)

3 > colnames(design) <- levels(rna)

4 > aw <- arrayWeights(maqc.norm.filt , design)

5 > aw

6 > fit <- lmFit(maqc.norm.filt , design , weights = aw)

7 > contrasts <- makeContrasts(UHRR - Brain , levels = design)

8 > contr.fit <- eBayes(contrasts.fit(fit , contrasts ))

9 > topTable(contr.fit , coef = 1)

10 > par(mfrow = c(1, 2))

11 > volcanoplot(contr.fit , main = "UHRR - Brain")

12 > smoothScatter(contr.fit$Amean , contr.fit$coef , xlab = "average intensity",

13 + ylab = "log -ratio")

14 > abline(h = 0, col = 2, lty = 2)

The code above shows how to set up a design matrix for this experiment to combine the datafrom the UHRR and Brain Reference replicates to give one value per condition. Empiricalarray quality weights [11] can be used to measure the relative reliability of each array. Avariance is estimated for each array by the arrayWeights function which measures how wellthe expression values from each array follow the linear model. These variances are convertedto relative weights which can then be used in the linear model to down-weight observationsfrom less reliable arrays which improves power to detect differential expression.We then define a contrast comparing UHRR to Brain Reference and calculate moderated t-statistics with empirical Bayes’ shrinkage of the sample variances.For more information about the lmFit, contrasts.fit and eBayes functions, refer to thelimma documentation.

2� The array weights are lowest for the first and second Brain Reference samples, whichmeans the observations from these samples will be down-weighted slightly in the linear modelanalysis. You will recall that the Brain Reference samples were less consistent than the UHRRsamples in the MDS plot (the second dimension separated out different replicate Brain Ref-erence samples). The UHRR and Brain Reference RNA samples are very different and wefind many differentially expressed genes between these two conditions in our analysis.

29

Z The lmFit function is able to accept a matrix as well as a limma object. Hence, userswith summarised bead-level data (as created in section 1) can also use this function afterextracting the expression matrix using the exprs function. The code would look somethinglike.

fit <- lmFit(exprs(datasumm), design, weights=aw)

Annotation

Annotating the results of a differential expression analysis

The topTable function displays the results of the empirical Bayes analysis alongside the an-notation assigned by Illumina to each probe in the linear model fit. Often this will not providesufficient information to infer biological meaning from the results. Within Bioconductor, an-notation packages are available for each of the major Illumina expression array platforms thatmap the probe sequences designed by Illumina to functional information useful for downstreamanalysis. As before, the illuminaHumanv2.db package can be used for the arrays in this exampledataset.

Use Case: Use the appropriate annotation package to annotate our differential expression analy-sis. Include the genome position, RefSeq ID, Entrez Gene ID, Gene symbol and Gene name infor-mation. Write the results from this analysis out to file.


2 > illuminaHumanv2 ()

3 > ids <- as.character(contr.fit$genes$ProbeID)

4 > ids2 <- unlist(mget(ids , revmap(illuminaHumanv2ARRAYADDRESS),


6 > chr <- mget(ids2 , illuminaHumanv2CHR , ifnotfound = NA)

7 > cytoband <- mget(ids2 , illuminaHumanv2MAP , ifnotfound = NA)

8 > refseq <- mget(ids2 , illuminaHumanv2REFSEQ , ifnotfound = NA)

9 > entrezid <- mget(ids2 , illuminaHumanv2ENTREZID , ifnotfound = NA)

10 > symbol <- mget(ids2 , illuminaHumanv2SYMBOL , ifnotfound = NA)

11 > genename <- mget(ids2 , illuminaHumanv2GENENAME , ifnotfound = NA)

12 > anno <- data.frame(Ill_ID = ids2 , Chr = as.character(chr),

13 + Cytoband = as.character(cytoband), RefSeq = as.character(refseq),

14 + EntrezID = as.numeric(entrezid), Symbol = as.character(symbol),

15 + Name = as.character(genename ))

16 > contr.fit$genes <- anno

17 > topTable(contr.fit)

18 > write.fit(contr.fit , file = "maqcresultsv2.txt")

Other useful applications of annotation packages

We now give a few brief example use cases that we have encountered in our own analyses.

Use Case: Retrieve the Illumina Humanv2 IDs that are part of the cell cycle according to GO(GO:0007049) or KEGG (04110)

30

1 > cellCycleProbesGO <- mget("GO :0007049", illuminaHumanv2GO2PROBE)

2 > cellCycleProbesKEGG <- mget("04110", illuminaHumanv2PATH2PROBE)

Use Case: Retrieve the Illumina Humanv2 IDs representing the ERBB2 oncogene. Which probeseems to be most appropriate for analysis?

1 > queryIDs <- mget("ERBB2", revmap(illuminaHumanv2SYMBOL ))

2 > mget("ERBB2", revmap(illuminaHumanv2SYMBOL ))

3 > mget(unlist(queryIDs), illuminaHumanv2PROBEQUALITY)

The illuminaHumanv2SYMBOL mapping is defined to map Illumina IDs to probe symbol, so ifwe want to go the opposite way, we have to use the revmap function.

2� There are three probes on the illuminaHumanv2 platform for ERBB2. All three are classedas Perfect, although only one is located at the 3’ end of the gene.

Use Case: Calculate the GC content of all Humanv2 probes and plot a histogram.

We will illustrate this use case using the Biostrings package. If you do not have this packageyou can install it in the usual way.


2 > BiocManager :: install("Biostrings")

1 > require("Biostrings")

2 > probeseqs <- unlist(as.list(illuminaHumanv2PROBESEQUENCE ))

3 > GC = vector(length = length(probeseqs ))

4 > ss <- BStringSet(probeseqs[which(!is.na(probeseqs ))])

5 > GC[which(!is.na(probeseqs ))] = letterFrequency(ss, letters = "GC")

6 > hist(GC/50, main = "GC proportion")

This code requires the Bioconductor Biostrings package which implements efficient string op-erations. The as.list(illuminaHumanv2PROBESEQUENCE) command is simply a shortcut toreturn the probe seqeunce for all mapped keys. We then convert the sequences into a Biostringsclass, on which we can use the letterFrequency function to give the desired result.

Use Case: Convert the Humanv2 probe locations into a RangedData object.

We will illustrate this use case using the GenomicRanges package. If you do not have thispackage you can install it in the usual way.


2 > BiocManager :: install("GenomicRanges")

1 > require("GenomicRanges")

2 > allLocs <- unlist(as.list(illuminaHumanv2GENOMICLOCATION ))

3 > chrs <- unlist(lapply(allLocs , function(x) strsplit(as.character(x),

4 + ":")[[1]][1]))

5 > spos <- as.numeric(unlist(lapply(allLocs , function(x) strsplit(as.character(x),

6 + ":")[[1]][2])))

31

7 > epos <- as.numeric(unlist(lapply(allLocs , function(x) strsplit(as.character(x),

8 + ":")[[1]][3])))

9 > strand <- substr(unlist(lapply(allLocs , function(x) strsplit(as.character(x),

10 + ":")[[1]][4])) , 1, 1)

11 > validPos <- !is.na(spos)

12 > Humanv2RD <- GRanges(seqnames = chrs[validPos], ranges = IRanges(start = spos[validPos],

13 + end = epos[validPos]), names = names(allLocs )[ validPos],

14 + strand = strand[validPos ])

Again we retrieve all genomic locations using the unlist(as.list(..) )trick. The locationof each probe is encoded as a string with the chromosome, start and end separated by a :

character. We can use the base strsplit function to decompose this string within an lapply

to process all sequences at once. The RangedData object can be created now, although we adda check to make sure that no NA values are passed.

Use Case: Find all Humanv2 probes located on chromsome 8 between bases 28,800,001 and36,500,000. What gene symbols do they target?

1 > query <- IRanges(start = 28800001 , end = 36500000)

2 > olaps <- findOverlaps(Humanv2RD , GRanges(ranges = query ,

3 + seqnames = "chr8"))

4 > matchingProbes <- as.matrix(olaps)[, 1]

5 > Humanv2RD[matchingProbes , ]

6 > Humanv2RD$names[matchingProbes]

7 > unlist(mget(Humanv2RD$names[matchingProbes], illuminaHumanv2SYMBOL ))

This example code uses functions from the IRanges package, so users wanting a deeper un-derstanding of the commands should consult the documentation for that package, and theappropriate help files. Essentially, we define a query region using the desired chromosome,start and end values. The findOverlaps function will then find regions on chromosome 8 ofthe Humanv2RD object that overlap with the query. The indices of the matching probes aregiven in the matchMatrix slot, which can be used to subset the Humanv2 Ranges.

What next?

The results of a differential expression analysis are often not the end-point of an analysis, andthere is an increasing desire to relate findings to biological function. Although this is beyondthe scope of this vignette, we will give a brief example of how a Gene Ontology analysis couldbe performed within Bioconductor. There is a wide range of online tools can perform similaranalyses, of which DAVID and Genetrail seem to be the most popular.

Use Case: Using GOstats find over-represented GO terms amongst the probes that show evi-dence for differential expression.

We will illustrate this use case using the GOstats package. If you do not have this package youcan install it in the usual way.


2 > BiocManager :: install("GOstats")

32

1 > require("GOstats")

2 > universeIds <- anno$EntrezID

3 > dTests <- decideTests(contr.fit)

4 > selectedEntrezIds <- anno$EntrezID[dTests == 1]

5 > params = new("GOHyperGParams", geneIds = selectedEntrezIds ,

6 + universeGeneIds = universeIds , annotation = "illuminaHumanv2",

7 + ontology = "BP", pvalueCutoff = 0.05, conditional = FALSE ,

8 + testDirection = "over")

9 > hgOver = hyperGTest(params)

10 > summary(hgOver )[1:10 , ]

The decideTests function is useful for selecting probes that show evidence for differentialexpression after correcting for multiple testing. We also have to define a universe of all possibleEntrez IDs in the dataset, and Entrez IDs that are significant. The GOstats package willthen map the Entrez IDs to GO terms (for both universe and significant probes) and use ahypergeometric test to see if any GO terms appear more often in the significant list than wewould expect by chance. See the vignette of the GOstats package for more details.

33

3 Analysis of public data using GEOquery

In this section we show how to retrieve an Illumina MAQC dataset (Human WG-6 version1) from the Gene Expression Omnibus database (GEO) using the GEOquery package [12].The GEO database is a public repository supporting MIAME-compliant data submissions ofmicroarray and sequence-based experiments.

Use Case: Read in the MAQC submitted data [1] from the Illumina platform using the getGEO

function. Extract information about the site each sample was prepared at, and the RNA samplehybridized. Make a boxplot of the intensities, color-coded by site to look for systematic differ-ences between labs.

1 > library(GEOquery)

2 > library(limma)


4 > gse <- getGEO(GEO = "GSE5350")[["GSE5350 -GPL2507_series_matrix.txt.gz"]]

5 > dim(gse)

6 > exprs(gse)[1:5 , 1:2]

7 > samples <- as.character(pData(gse)[, "title"])

8 > sites <- as.numeric(substr(samples , 10, 10))

9 > shortlabels <- substr(samples , 12, 13)

10 > rnasource <- pData(gse)[, "source_name_ch1"]

11 > levels(rnasource) <- c("UHRR", "Brain", "UHRR75", "UHRR25")

12 > boxplot(log2(exprs(gse)), col = sites + 1, names = shortlabels ,

13 + las = 2, cex.names = 0.5, ylab = expression(log [2]( intensity)),

14 + outline = FALSE , ylim = c(3, 10), main = "Before batch correction")

Z Data deposited in the GEO database may be either raw or normalized. This preprocess-ing information is annotated in the database entry, and is accessible by typingpData(gse)$data_processing in this example. Boxplots of the data can also be used to seewhether normalization has taken place.

2� This dataset consists of 59 arrays, each of which contains 47,293 probes. Samples wereprocessed and hybridized in 3 different labs (see the sites vector), and subject to cubicspline normalization which appears to have been applied separately to the data from eachsite.

Z Microarray data deposited in ArrayExpress, the other major public database of high-throughput genomics experiments, can be imported into R using the ArrayExpress package[13].

If the above code does not work due to a network error, the same dataset, albeit with fewerreplicates, may be obtained from an existing Bioconductor package named MAQCsubsetILM.


2 > BiocManager :: install("MAQCsubsetILM")

3 > library(MAQCsubsetILM)

4 > data(refA)

5 > data(refB)

6 > data(refC)

34

7 > data(refD)

8 > gse = combine(refA , refB , refC , refD)

9 > sites = pData(gse)[, 2]

10 > shortlabels = substr(sampleNames(gse), 7, 8)

11 > rnasource = pData(gse)[, 3]

12 > levels(rnasource) = c("UHRR", "Brain", "UHRR75", "UHRR25")

13 > boxplot(log2(exprs(gse)), col = sites + 1, names = shortlabels ,


15 + outline = FALSE , ylim = c(3, 10), main = "Before batch correction")

Use Case: Remove any differences between labs using the removeBatchEffect function andmake a boxplot and MDS plot of the corrected data to assess the effectiveness of this step. Nextfilter out probes with poor annotation and perform a differential expression analysis between theUHRR and Brain Reference samples. Annotate the results with the same information obtained insection 2 (genome position, RefSeq ID, Entrez Gene ID, Gene symbol and Gene name) with anappropriate annotation package. Write the results out to file.

1 > gse.batchcorrect <- removeBatchEffect(log2(exprs(gse)),

2 + batch = sites)

3 > par(mfrow = c(1, 2), oma = c(1, 0.5, 0.2, 0.1))

4 > boxplot(gse.batchcorrect , col = sites + 1, names = shortlabels ,


6 + outline = FALSE , ylim = c(3, 10), main = "After batch correction")

7 > plotMDS(gse.batchcorrect , labels = shortlabels , col = sites +

8 + 1, main = "MDS plot")

9 > ids3 <- featureNames(gse)

10 > qual2 <- unlist(mget(ids3 , illuminaHumanv1PROBEQUALITY ,


12 > table(qual2)

13 > rem2 <- qual2 == "No match" | qual2 == "Bad" | is.na(qual2)

14 > gse.batchcorrect.filt <- gse.batchcorrect[!rem2 , ]

15 > dim(gse.batchcorrect)

16 > dim(gse.batchcorrect.filt)

17 > design2 <- model.matrix(~0 + rnasource)

18 > colnames(design2) <- levels(rnasource)

19 > aw2 <- arrayWeights(gse.batchcorrect.filt , design2)

20 > fit2 <- lmFit(gse.batchcorrect.filt , design2 , weights = aw2)

21 > contrasts2 <- makeContrasts(UHRR - Brain , levels = design2)

22 > contr.fit2 <- eBayes(contrasts.fit(fit2 , contrasts2 ))

23 > topTable(contr.fit2 , coef = 1)

24 > volcanoplot(contr.fit2 , main = "UHRR - Brain")

25 > ids4 <- rownames(gse.batchcorrect.filt)

26 > chr2 <- mget(ids4 , illuminaHumanv1CHR , ifnotfound = NA)

27 > chrloc2 <- mget(ids4 , illuminaHumanv1CHRLOC , ifnotfound = NA)

28 > refseq2 <- mget(ids4 , illuminaHumanv1REFSEQ , ifnotfound = NA)

29 > entrezid2 <- mget(ids4 , illuminaHumanv1ENTREZID , ifnotfound = NA)

30 > symbols2 <- mget(ids4 , illuminaHumanv1SYMBOL , ifnotfound = NA)

31 > genename2 <- mget(ids4 , illuminaHumanv1GENENAME , ifnotfound = NA)

32 > anno2 <- data.frame(Ill_ID = ids4 , Chr = as.character(chr2),

33 + Loc = as.character(chrloc2), RefSeq = as.character(refseq2),

34 + Symbol = as.character(symbols2), Name = as.character(genename2),

35 + EntrezID = as.numeric(entrezid2 ))

36 > contr.fit2$genes <- anno2

35

37 > write.fit(contr.fit2 , file = "maqcresultsv1.txt")

2� The batch correction procedure equalizes the mean expression level from each lab, whichremoves the differences that were evident in the earlier boxplot. The MDS plot shows thatsamples separate by RNA source, with the first dimension separating the pure samples (UHRR(A) and Brain Reference (B)) from each other, and the second dimension separating thepure samples (A and B) from the mixture samples (75% UHRR:25% Brain Reference (C)and 25% UHRR:75% Brain Reference (D)). After filtering out the probes with poor annota-tion, we are left with 25,797 probes. As we saw in the Asuragen MAQC dataset, there is agreat deal of differential expression between the UHRR and Brain Reference samples. Hav-ing a greater number of replicate samples (15 each for UHRR and Brain Reference) leads togreater statistical significance for this comparison relative to the Asuragen dataset, whichhad just 3 replicates for each RNA source.

Combining data from different BeadChip versions

In this section we illustrate how to combine data from different BeadChip versions from thesame species, using the GEO MAQC dataset (Human version 1) and the Asuragen MAQCdataset (Human version 2). The approach taken in this example can be used to combine datafrom different microarray platforms as well.

Use Case: Use Entrez Gene identifiers to match genes from different BeadChip versions. Makea data frame which includes fold-changes between UHRR and Brain Reference samples for bothdatasets for the genes which could be matched between platforms and plot the log-fold-changes.Does expression agree between platforms?

1 > z <- contr.fit[!is.na(contr.fit$genes$EntrezID), ]

2 > z <- z[order(z$genes$EntrezID), ]

3 > f <- factor(z$genes$EntrezID)

4 > sel.unique <- tapply(z$Amean , f, function(x) x == max(x))

5 > sel.unique <- unlist(sel.unique)

6 > contr.fit.unique <- z[sel.unique , ]

7 > z <- contr.fit2[!is.na(contr.fit2$genes$EntrezID), ]

8 > z <- z[order(z$genes$EntrezID), ]

9 > f <- factor(z$genes$EntrezID)

10 > sel.unique <- tapply(z$Amean , f, function(x) x == max(x))

11 > sel.unique <- unlist(sel.unique)

12 > contr.fit2.unique <- z[sel.unique , ]

13 > m <- match(contr.fit.unique$genes$EntrezID , contr.fit2.unique$genes$EntrezID)

14 > contr.fit.common <- contr.fit.unique[!is.na(m), ]

15 > contr.fit2.common <- contr.fit2.unique[m[!is.na(m)],

16 + ]

17 > lfc <- data.frame(lfc_version1 = contr.fit2.common$coef[,

18 + 1], lfc_version2 = contr.fit.common$coef[, 1])

19 > dim(lfc)

20 > options(digits = 2)

21 > lfc[1:10, ]

22 > plot(lfc[, 1], lfc[, 2], xlab = "version 1", ylab = "version 2")

23 > abline(0, 1, col = 2)

36

2� Even though the probes were redesigned between versions 1 and 2 of the Human geneexpression BeadChip, we see good concordance between the log-ratios from each dataset.

Z A representative probe was selected for each gene on each platform before the two datasetscould be combined. In this example, the probe with the largest average intensity among allthe probes belonging to the same gene was selected from each version.

Use Case: (Advanced) Compare the version 1 and 2 results with those obtained from the Hu-man HT-12 version 3 dataset analyzed earlier in this tutorial. You will first need to between-arraynormalize the HT-12 data, and may find it useful to filter poorly annotated probes and down-weight observations for less reliable arrays identified during the quality assessment process. Dif-ferential expression can be assessed using linear modelling and contrasts as previously shown. Youwill then need to annotate the probes using Entrez Gene IDs. How many probes can be matchedbetween all 3 platforms? How well does the version 3 data agree with data from earlier BeadChipversions.

Z Example code for this final use case has not been provided. The reader should be ableto complete this exercise independently using knowledge gained from the previous examplesgiven in the vignette.

There are many other analysis tools available from R or Bioconductor that could be used fordownstream analysis on these data. For further details, see the relevant vignettes or help pages.

The version of R and the packages used to complete this tutorial are listed below. If you havefurther questions about using any of the Bioconductor packages used in this tutorial, pleaseemail the Bioconductor mailing list ([email protected]).

1 > sessionInfo ()

R version 3.6.1 (2019 -07 -05)

Platform: x86_64-pc -linux -gnu (64-bit)

Running under: Ubuntu 18.04.3 LTS

Matrix products: default

BLAS: /home/biocbuild/bbs -3.10- bioc/R/lib/libRblas.so

LAPACK: /home/biocbuild/bbs -3.10- bioc/R/lib/libRlapack.so

locale:

[1] LC_CTYPE=en_US.UTF -8 LC_NUMERIC=C

[3] LC_TIME=en_US.UTF -8 LC_COLLATE=C

[5] LC_MONETARY=en_US.UTF -8 LC_MESSAGES=en_US.UTF -8

[7] LC_PAPER=en_US.UTF -8 LC_NAME=C

[9] LC_ADDRESS=C LC_TELEPHONE=C

[11] LC_MEASUREMENT=en_US.UTF -8 LC_IDENTIFICATION=C

attached base packages:

[1] stats4 parallel stats graphics grDevices utils

[7] datasets methods base

other attached packages:

[1] limma_3.42.0 GenomicRanges_1.38.0

37

[3] GenomeInfoDb_1.22.0 Biostrings_2.54.0

[5] XVector_0.26.0 illuminaHumanv3.db_1.26.0

[7] illuminaHumanv1.db_1.26.0 illuminaHumanv2.db_1.26.0

[9] org.Hs.eg.db_3.10.0 AnnotationDbi_1.48.0

[11] IRanges_2.20.0 S4Vectors_0.24.0

[13] GEOquery_2.54.0 beadarray_2.36.0

[15] hexbin_1.27.3 Biobase_2.46.0

[17] BiocGenerics_0.32.0

loaded via a namespace (and not attached ):

[1] base64_2.0 Rcpp_1.0.2

[3] lattice_0.20 -38 tidyr_1.0.0

[5] assertthat_0.2.1 zeallot_0.1.0

[7] digest_0.6.22 R6_2.4.0

[9] plyr_1.8.4 backports_1.1.5

[11] RSQLite_2.1.2 ggplot2_3.2.1

[13] pillar_1.4.2 zlibbioc_1.32.0

[15] rlang_0.4.1 lazyeval_0.2.2

[17] curl_4.2 blob_1.2.0

[19] readr_1.3.1 stringr_1.4.0

[21] RCurl_1.95 -4.12 bit_1.1-14

[23] munsell_0.5.0 compiler_3.6.1

[25] pkgconfig_2.0.3 askpass_1.1

[27] openssl_1.4.1 tidyselect_0.2.5

[29] tibble_2.1.3 GenomeInfoDbData_1.2.2

[31] crayon_1.3.4 dplyr_0.8.3

[33] bitops_1.0-6 grid_3.6.1

[35] gtable_0.3.0 lifecycle_0.1.0

[37] DBI_1.0.0 magrittr_1.5

[39] scales_1.0.0 stringi_1.4.3

[41] reshape2_1.4.3 xml2_1.2.2

[43] ellipsis_0.3.0 vctrs_0.2.0

[45] BeadDataPackR_1.38.0 tools_3.6.1

[47] illuminaio_0.28.0 bit64_0.9-7

[49] glue_1.3.1 purrr_0.3.3

[51] hms_0.5.2 colorspace_1.4-1

[53] memoise_1.1.0

Acknowledgements

We thank James Hadfield and Michelle Osbourne for generating the HT-12 data used in thefirst section and the attendees of the various courses we have conducted on this topic for theirfeedback, which has helped improve this document.

References

[1] MAQC Consortium. The MicroArray Quality Control (MAQC) project shows inter-and intraplatform reproducibility of gene expression measurements. Nat Biotechnol,24(9):1151–61, September 2006.

38

[2] M. L. Smith and A. G. Lynch. BeadDataPackR: A Tool to Facilitate the Sharing of RawData from Illumina BeadArray Studies. Cancer Inform, 9:217–27, 2010.

[3] J. M. Cairns, M. J. Dunning, M. E. Ritchie, R. Russell, and A. G. Lynch. BASH: a toolfor managing BeadArray spatial artefacts. Bioinformatics, 24(24):2921–2, 2008.

[4] M. Suarez-Farinas, A. Haider, and K. Wittkowski. ‘harshlighting’ small blemishes onmicroarrays. BMC Bioinformatics, 6(1):65, 2005.

[5] M. L. Smith, M. J. Dunning, S. Tavare, and A. G. Lynch. Identification and correctionof previously unreported spatial phenomena using raw Illumina BeadArray data. BMCBioinformatics, 11:208, 2010.

[6] W. Shi, C. A. de Graaf, S. A. Kinkel, A. H. Achtman, T. Baldwin, L. Schofield, H. S.Scott, D. J. Hilton, and G. K. Smyth. Estimating the proportion of microarray probesexpressed in an RNA sample. Nucleic Acids Res, 38(7):2168–76, 2010.

[7] W. Shi, A. Oshlack, and G. K. Smyth. Optimizing the noise versus bias trade-off forIllumina Whole Genome Expression BeadChips. Nucleic Acids Res, 38(22):e204, 2010.

[8] R.A. Irizarry, B. Hobbs, F. Collin, Y. D. Beazer-Barclay, K.J. Antonellis, U. Scherf, andT.P. Speed. Exploration, normalization, and summaries of high density oligonucleotidearray probe level data. Biostatistics, 4(2):249–64, 2003.

[9] M. E. Ritchie, J. Silver, A. Oshlack, M. Holmes, D. Diyagama, A. Holloway, and G. K.Smyth. A comparison of background correction methods for two-colour microarrays.Bioinformatics, 23(20):2700–7, 2007.

[10] N. L. Barbosa-Morais, M. J. Dunning, S. A. Samarajiwa, J. F. Darot, M. E. Ritchie, A. G.Lynch, and S. Tavare. A re-annotation pipeline for Illumina BeadArrays: improving theinterpretation of gene expression data. Nucleic Acids Res, 38(3):e17, 2010.

[11] M. E. Ritchie, D. Diyagama, J. Neilson, R. van Laar, A. Dobrovic, A. Holloway, andG. K. Smyth. Empirical array quality weights in the analysis of microarray data. BMCBioinformatics, 19(7):261, 2006.

[12] S. Davis and P. S. Meltzer. GEOquery: a bridge between the Gene Expression Omnibus(GEO) and Bioconductor. Bioinformatics, 23(14):1846–7, 2007.

[13] A. Kauffmann, T. F. Rayner, H. Parkinson, M. Kapushesky, M. Lukk, A. Brazma,and W. Huber. Importing ArrayExpress datasets into R/Bioconductor. Bioinformatics,25(16):2092–4, 2009.

39

beadarray expression analysis using bioconductor...beadarray expression analysis using bioconductor...

Documents