university of groningen genetical genomics with affymetrix ... · sequence is used in groups of 3...

17
University of Groningen Genetical genomics with Affymetrix gene expression arrays Alberts, Rudi IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below. Document Version Publisher's PDF, also known as Version of record Publication date: 2007 Link to publication in University of Groningen/UMCG research database Citation for published version (APA): Alberts, R. (2007). Genetical genomics with Affymetrix gene expression arrays. Groningen: s.n. Copyright Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons). Take-down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum. Download date: 15-04-2020

Upload: others

Post on 11-Apr-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: University of Groningen Genetical genomics with Affymetrix ... · sequence is used in groups of 3 nucleotides at a time, called codons. Each possible codon represents an amino acid

University of Groningen

Genetical genomics with Affymetrix gene expression arraysAlberts, Rudi

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite fromit. Please check the document version below.

Document VersionPublisher's PDF, also known as Version of record

Publication date:2007

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA):Alberts, R. (2007). Genetical genomics with Affymetrix gene expression arrays. Groningen: s.n.

CopyrightOther than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of theauthor(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

Take-down policyIf you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediatelyand investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons thenumber of authors shown on this cover page is limited to 10 maximum.

Download date: 15-04-2020

Page 2: University of Groningen Genetical genomics with Affymetrix ... · sequence is used in groups of 3 nucleotides at a time, called codons. Each possible codon represents an amino acid

CH

AP

TE

R 1General introduction

Page 3: University of Groningen Genetical genomics with Affymetrix ... · sequence is used in groups of 3 nucleotides at a time, called codons. Each possible codon represents an amino acid

Abstract

This chapter introduces the basic concepts needed to better understand the rest of

the thesis: complex traits, genetics (mice crosses), transcriptomics (measuring gene

expression using Affymetrix microarrays), and data pre-processing (normalization of

data, some pitfalls).

2 General introduction

Page 4: University of Groningen Genetical genomics with Affymetrix ... · sequence is used in groups of 3 nucleotides at a time, called codons. Each possible codon represents an amino acid

Complex trait analysis is a challenging field in biological research. A trait (char-

acteristic) of an individual is developed under the influence of both its genes

and environmental factors. A complex trait is a trait that has multiple determinants,

which may be genetic, environmental, or both. Many diseases, such as obesity, hyper-

tension and cancer are complex traits. In complex trait analysis, one aims to reveal

the determinants underlying the trait or more loosely spoken to find how many genes

are encoding the trait, where the genes are located, what their effects are, and how

they talk to each other and are affected by the environment.

DNA contains the genetic information in living cells. Genes are segments of DNA

sequence, often encoding for proteins. DNA (deoxyribonucleic acid) is made from

nucleotides, consisting of a sugar-phosphate molecule and a base. There are four types

of bases: adenine, guanine, cytosine and thymine, corresponding to four nucleotides

labeled A, G, C and T. A normal DNA molecule consists of two complementary

strands of nucleotides. T in one strand pairs with A in the other, and G in one

strand pairs with C in the other. The two strands twist around each other to form

a double helix. Figure 1.1 shows what is called the central dogma in biology: how

DNA is transcribed into RNA (ribonucleic acid), which subsequently is translated into

protein. The following description is valid for eukaryotes (animals, plants, fungi and

protists). In the first step, RNA transcription, the enzyme RNA polymerase makes a

copy of the gene from the DNA into mRNA (messenger RNA). The backbone of RNA

is formed of a slightly different sugar from that of DNA – ribose instead of deoxyribose

– and one of the four bases is slightly different: uracil (U) instead of thymine (T).

Next, the mRNA molecule must be further modified before it can be used. A so

called cap is added at the beginning of the molecule and a Poly-A tail is added at

the end. Regions that do not contain genetic code, called introns, are removed to

finally form the mature mRNA. This leaves the nucleus and is translated into protein

at the ribosomes. Proteins are build from amino acids. There are 20 types of amino

acids. The translation of the genetic information from a 4-letter alphabet into the

20-letter alphabet of proteins is a complex process. The information in the mRNA

sequence is used in groups of 3 nucleotides at a time, called codons. Each possible

codon represents an amino acid and an amino acid can be represented by multiple

codons, since there are 4*4*4=64 codons and only 20 amino acids.

Determined by the order of the nucleotides in the mRNA, amino acids are bound

together during protein translation to form proteins. In prokaryotes (bacteria and

archaea) the situation is simpler. The main differences with eukaryotes are that

prokaryotic cells do not have a nucleus and other complex cell structures, they only

contain a single loop of DNA and their DNA is much more compact, since they do

not have introns and large inter-genic regions.

Proteins are essential parts of all living organisms. Many proteins are enzymes

3

Page 5: University of Groningen Genetical genomics with Affymetrix ... · sequence is used in groups of 3 nucleotides at a time, called codons. Each possible codon represents an amino acid

Figure 1.1: In the nucleus, DNA is transcribed into primary mRNA and further processed

into mature mRNA. At the ribosomes, mature mRNA is translated into protein. Figure

copied from http://www.hbcprotocols.com.

that catalyze biochemical reactions. Other proteins have mechanical or structural

functions. Proteins are also important in cell signaling and immune responses. They

play an important role in the development of complex traits. Biological systems can

be studied at any level in Figure 1.1. This thesis focuses on the amounts of mRNA

present.

A common way of studying mRNA levels of genes is by considering one gene at a

time, e.g. by creating so-called knockout individuals where the gene has been made

inactive. Differences between the normal individuals and the knockout individuals

might reveal information about the function of the gene. With microarray technol-

ogy, the amount of RNA can be measured for thousands of genes simultaneously.

To understand processes behind the development of complex traits, it can be very

informative to look at the differences in gene activity between normal and knock-out

individuals.

A more powerful approach is to multifactorially perturb the biological system.

This is done in genetical genomics (Jansen 2003).

4 General introduction

Page 6: University of Groningen Genetical genomics with Affymetrix ... · sequence is used in groups of 3 nucleotides at a time, called codons. Each possible codon represents an amino acid

1.1 Genetic dissection of complex traits

Using microarray technology, how can we reveal the genetic basis of complex traits?

A powerful method for doing this is genetical genomics, which was introduced by

Jansen and Nap in 2001 (Jansen and Nap 2001). Genetical genomics basically is the

application of classical QTL mapping on gene expression data. In QTL (quantita-

tive trait locus) mapping, a quantitative trait, such as height, of genetically related

individuals is measured. Often, recombinant inbred lines (RILs) are used. RILs are

created by starting with the F1 of two homozygous parents and inbreeding them for

six or more successive generations. This produces (almost) homozygous RILs of which

Figure 1.2: The construction of recombinant inbred lines (RILs) from B6 and DBA

parental lines. The bars represent chromosome pairs. Grey (white) indicates that this

genomic interval is derived from B6 (DBA). The numbers 1-5 represent molecular mark-

ers. Starting from the homozygous parents, heterozygous F1 is created. Crossng-overs

cause a mixture of B6 and DBA genomes in the F2 generation. The two circles repre-

sent parts of the genome that are already fixed for one parental type. Further inbreed-

ing provides homozygous RILs (BXD-101, BXD-102 and BXD-103). Figure copied from

http://www.informatics.jax.org.

Genetic dissection of complex traits 5

Page 7: University of Groningen Genetical genomics with Affymetrix ... · sequence is used in groups of 3 nucleotides at a time, called codons. Each possible codon represents an amino acid

the genomes are a mixture of the two parental genomes (Figure 1.2). To determine

which part of the genome in the RILs is inherited from which parental line, molecular

markers are used. These are specific places in the DNA where the sequence differs

between both parental lines and that can be identified by a simple assay. By com-

paring the quantitative trait values of each of the RILs with their molecular markers

along the genome, genomic regions (QTLs) are found where markers and trait values

correlate. The QTL regions probably contain gene(s) that regulate the quantitative

trait. In genetical genomics, gene expression is taken as quantitative trait. Hence,

for each gene, regulatory regions or even regulating genes can be found. QTLs for

gene expression traits are also called expression QTLs (eQTLs). Genetical genomics

makes use of the naturally occurring genetic variation between individuals and uses

the genetic mechanisms of segregation and recombination to produce offspring with

many combinations of gene variants. In this way the effect of differences in expression

of multiple genes can be studied at once, and combined effects may be revealed that

stay uncovered when looking at the effect on one gene at a time. See Chapter 2 for

a more detailed explanation of genetical genomics.

The genomic location of an eQTL can be compared with the location of the gene

for which the eQTL was found. When the eQTL is located close to the gene under

study (mostly up to 5 or 10 Mb), the eQTL is said to be cis-acting. When the eQTL

is located farther away from the gene, the eQTL is trans-acting.

Nowadays, microarrays have become more and more affordable. This makes it

possible to create gene expression profiles of tens or even hundreds of individuals and

to fully exploit the power of genetical genomics. Indeed, multiple genetical genomics

studies have recently appeared. See the review of Rockman and Kruglyak (Rockman

and Kruglyak 2006) and the references therein. Table 1.1 lists the majority of the

studies where Affymetrix arrays were used and indicates the numbers of cis and trans

genes reported.

Paper Organism No. cis genes No. trans genes

Hubner et al. 2005 rat 97 18

Bystrykh et al. 2005 mouse 162 136

Chesler et al. 2005 mouse 92 9

Morley et al. 2004 human 27 110

Cheung et al. 2005 human 12 1

Table 1.1: The amounts of cis and trans genes reported in the main of genetical genomics

studies that used Affymetrix arrays.

Many research groups are working on the genetics of complex traits (http://www.

complextrait.org). http://www.genenetwork.org is an example of a valuable

6 General introduction

Page 8: University of Groningen Genetical genomics with Affymetrix ... · sequence is used in groups of 3 nucleotides at a time, called codons. Each possible codon represents an amino acid

repository of complex trait data and analysis tools. It can be used to help understand

interactions among networks of gene variants, transcripts and ’classical’ traits such as

cancer susceptibility. It incorporates ’classical’ trait data collected in more than 25

years, genetic maps, genome sequence and large microarray datasets. QTL analysis

can be performed on both ’classical’ traits and gene expression traits.

In each of the experiments in Table 1.1, the individuals used for gene expression

profiling are expected to show genetic variation. One example of genetic variation

is the single nucleotide polymorphisms (SNP). This is a place in the DNA where

individuals differ by one nucleotide, e.g. where one individual carries a G and another

carries an A. Another type of genetic variation are insertions and deletions, where a

piece of DNA is either added or missing. A third type is gene copy number variation,

where different individuals have different amounts of copies of a gene in their DNA.

1.2 Affymetrix microarrays

This thesis focuses on Affymetrix GeneChip R© short-oligonucleotide arrays

(Lockhart et al. 1996). These arrays are small wafers of which the area is divided into

thousands of very small pieces (spots). Each spot contains so called probes. Probes

are small synthesized pieces of single-stranded DNA (Figure 1.3A). Probes are chosen

to fit to a unique part of the target RNA of which the quantity has to be measured.

They are complementary to this sequence since the RNA will then automatically

bind (hybridize) to the probe. The target RNA is labeled with a fluorescent label.

Figure 1.3B shows how the target RNA binds to probes on a microarray. Finally, the

RNA molecules that did not bind to any probe are washed away from the array and

AA

A B

Figure 1.3: (A) DNA probes on an Affymetrix microarray. (B) Fluorescent labeled RNA

hybridizes to probes on the microarray. Courtesy of Affymetrix.

Affymetrix microarrays 7

Page 9: University of Groningen Genetical genomics with Affymetrix ... · sequence is used in groups of 3 nucleotides at a time, called codons. Each possible codon represents an amino acid

..ACTTGTGCCGTTCTCTGAGAGAAAAAGACTGA.. DNA TGTGCCGTTCTCTGAGAGAAAAAGA PM probe 1 TGTGCCGTTCTCAGAGAGAAAAAGA MM probe 1

GTGCCGTTCTCTCAGAGAAAAAGAC MM probe 2 GTGCCGTTCTCTGAGAGAAAAAGAC PM probe 2

Probe Sequence1 CTCTGTGAATTCTCCTCCGAGAGAG2 TGTGCCGTTCTCTGAGAGAAAAAGA

3 GTGCCGTTCTCTGAGAGAAAAAGAC4 GGGCTCTGAACCAGTATCCGGCATT5 TCTGAACCAGTATCCGGCATTATAT6 CTACAGAGATTTCTTTCCAGAGTAT7 TACAGAGATTTCTTTCCAGAGTATA8 CAGAGATTTCTTTCCAGAGTATATG9 TGACCCACAGAGCTACTGTCCCATG10 GCAGCTGCAGGGACAGATTGCCCTC11 TGCAGGGACAGATTGCCCTCCTGGT12 TTCCATGTCGCTAATCCCGTGATTA13 TCGCTAATCCCGTGATTATGTCTGC14 CGCTAATCCCGTGATTATGTCTGCA15 CCCGTGATTATGTCTGCAAACATGG16 GCAGAGCCAGGCAGGCTGTAAATAA

A

B

Figure 1.4: (A) Part of the target DNA sequence and two example PM and MM probe

sequences. The PM probes perfectly match the target, and in the MM probe the middle

nucleotide is replaced by its complement. (B) PM probe sequences of probe set 102934 s at

for gene Cdc25c.

the amount of fluorescence for each probe is measured in a scanner. This amount

represents the amount of RNA that was present for each gene.

The Affymetrix technology uses multiple probes per gene to measure RNA abun-

dances. For each gene one (or multiple) probe sets are designed, that each con-

tain between 11 and 20 perfect match (PM) and mismatch (MM) probes (Figure

1.4A). Figure 1.4B shows the sequences of the 16 PM probes in example probe set

102934 s at. PM and MM probes are always located next to each other on the mi-

croarray. However, all PM/MM pairs are randomly distributed over the area of the

microarray. Figure 1.5A shows the relative probe positions on the RNA (top) and

PM probe signals (bottom) for this probe set. It is striking to see that, although

the probes are designed to measure the same RNA target, their signals can be quite

different. However, this is a representative example of how the signals in a probe set

behave. In general, the probes in a probe set can give very different signals. How-

ever, when comparing multiple samples (lines in the figure), the probe signals mostly

follow the same trend.

The differences in signal between probes can be caused by many factors (summa-

rized in Table 1.2). 1) The melting temperature of DNA molecules is defined as the

temperature at which 50% of the molecules form a stable double helix and the other

8 General introduction

Page 10: University of Groningen Genetical genomics with Affymetrix ... · sequence is used in groups of 3 nucleotides at a time, called codons. Each possible codon represents an amino acid

1 2345

678 9 1011 12131415 16

Relative probe position5

10

15

A

0

2000

4000

6000

8000

10000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Probe

PM

sig

na

lP

rob

e

Pro

be

Lo

g2

PM

sig

na

l

1 2345

678 9 1011 12131415 16

Relative probe position5

10

15

B

4

6

8

10

12

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Probe

Figure 1.5: (A) Relative probe positions on the RNA for probe set 102934 s at (top) and

PM probe signals (bottom). Each line represents a mouse sample. (B) Similar plots for the

log-transformed PM signals.

50% is separated in single strand molecules. The melting temperature depends on

the sequence length and nucleotide composition. It is related to the amount of G and

C nucleotides occurring in the sequence. The higher the melting temperature of a

probe, the better it binds RNA molecules (Figure 1.6A). 2) RNA is amplified before

it is hybridized to the microarray. This amplification starts at the end of the RNA

and stops at random places, because of which probes located more to the end of the

RNA have a tendency to show a higher signal. 3) When the sequence of a probe

is based on a wrong messenger sequence, it will probably not measure any target

and give a low signal. When the probe sequence also fits other RNAs, those other

RNAs will also bind to this probe (cross-hybridization) and the signal will become

higher (Figure 1.6B). 4) Both RNA and probes could form secondary structure, e.g.

by binding to themselves. This could hinder hybridization. The following two points

No. Factor

1 Probe melting temperature / CG content

2 Probe location on the RNA

3 Errors in probe design

4 Probe and RNA secondary structure

5 Probe density on array

6 Errors during array manufacturing

Table 1.2: Six factors leading to differences in signal for probes in one probe set.

Affymetrix microarrays 9

Page 11: University of Groningen Genetical genomics with Affymetrix ... · sequence is used in groups of 3 nucleotides at a time, called codons. Each possible codon represents an amino acid

50 60 70 80

01

00

02

00

03

00

04

00

05

00

0

Probe melting temperature

Pro

be

inte

nsi

ty

−2 −1 0 1 2 3 4 5 A M

BLAST group

Lo

g2

inte

nsi

ty

8.0

8.5

9.0

9.5

10

.0

BA

Figure 1.6: (A) Probe melting temperature vs. probe intensity. Probes were divided

into groups according to their melting temperature, calculated following (SantaLucia 1998).

The probe intensities are plotted per group (black). The grey crosses are the mean intensity

values per group, and increase for higher melting temperatures. The melting temperature

explains only part of the variation in intensity. (B) Number of probe matches against

messenger sequence database vs. probe intensities. Probe sequences were compared with

messenger sequences and probes were grouped according to the matching results. -2 and -1

represent probes with a best hit with 2 or 1 mismatch(es). 0 till 5 represent probes with 1

till 5 perfect matches. M (missing) probes were not found and A (anti-sense) probes had

the wrong orientation. The better the probes match with one or multiple genes, the higher

the signal.

are purely hypothetical: 5) When the amount of probes per spot on the array dif-

fers per spot, differences in probe signals can arise. 6) When there is an error in a

mask during chip production, wrong probes will appear on the chip, leading to lower

signals.

The log transformed PM signals of Figure 1.5A are shown in Figure 1.5B. When

comparing both figures, one sees that on the original scale the variation per probe can

differ considerably, e.g. probe 6 shows little variation and probe 11 much. On log scale

the variation between individuals is much more consistent over probes. Differences

between individuals can be studied better because they are also consistent over probes

and not reduced by probes with low variation. Microarray data are usually log

transformed, since after the transformation the distribution of the data will be much

closer to a normal distribution. Many statistical tests assume the data to be normally

distributed. Moreover, on log scale one obtains symmetry in up- and downregulation,

e.g. on log2 scale an increase of 1 corresponds with 2-fold upregulation and a decrease

10 General introduction

Page 12: University of Groningen Genetical genomics with Affymetrix ... · sequence is used in groups of 3 nucleotides at a time, called codons. Each possible codon represents an amino acid

of 1 corresponds with 2-fold downregulation.

1.3 Preprocessing Affymetrix array data

In the past years many research groups have developed algorithms for preprocessing

of Affymetrix array data. Usually, the following steps are performed: background

correction, normalization and summarization. The background signal is measured ir-

respective of any true signal. Several solutions have proposed to correct for this effect.

Mostly multiple microarrays are used within one experiment. Because of differences

introduced for example during the sample preparation process or the RNA ampli-

fication process, the overall signal levels can differ between arrays. Between-array

normalization equates the overall signal of multiple arrays so that fair comparisons

can be made. In the summarization step, the multiple probe signals per probe set

are summarized into one expression value per probe set.

Below, I will discuss the Microarray Suite 5.0 software that Affymetrix provides

for analyzing GeneChip R© data (MAS 5.0; Affymetrix Microarray Suite User Guide,

Version 5, www.affymetrix.com/support/technical/manuals.affx) and the soft-

ware that I have used throughout this thesis: Robust Multi-array Average (RMA;

Irizarry, Hobbs, Collin, Beazer-Barclay, Antonellis, Scherf and Speed’s (2003)). I

used RMA because it performs better than MAS 5.0 (Irizarry, Bolstad, Collin, Cope,

Hobbs and Speed 2003). RMA has higher precision, especially for lower expression

values. RMA provides more consistent estimates of fold change and provides higher

specificity and sensitivity when using fold change analysis to detect differential ex-

pression. Chapter 6 shows how RMA is performed in the statistical programming

language R.

MAS 5.0 background correction. Each array is divided into equally spaced

zones and an average background is assigned to the center of each zone. The distances

from each spot to the center of each zone are calculated. The distances are used as

weights in calculating the background per spot based on the background values of

the zones. Finally the background is subtracted from the signal.

MAS 5.0 summarization. The signal of a probe set is defined as the anti-log

of a robust average (Tukey biweight) of the values log(PMj − CTj), for probe pairs

j = 1, ..., J , where CTj is MM when MM<PM, but adjusted to be less than PM if

MM>PM. The adjustment is based on the average difference between PM and MM,

or if that measure is too small, on some fraction of PM.

MAS 5.0 normalization. All arrays are scaled so that they have the same

mean value.

RMA background correction. The background correction is based on PM

values only. An additive background plus specific signal is proposed, where the back-

Preprocessing Affymetrix array data 11

Page 13: University of Groningen Genetical genomics with Affymetrix ... · sequence is used in groups of 3 nucleotides at a time, called codons. Each possible codon represents an amino acid

ground follows a normal distribution and the specific signal follows an exponential

distribution.

RMA normalization. RMA uses a procedure called quantile normalization,

which gives the signals of each array the same distribution. The algorithm is as

follows: 1) Collect the signals of all p probes of n arrays in a matrix X with n

columns and p rows. 2) Sort each column in X to give Xsort. 3) Take the mean

across rows of Xsort and assign this mean to each element in the row to get X ′sort.

4) Get the normalized data by rearranging each column in X ′sort to have the same

ordering as the original X .

RMA summarization. Expression measures per probe set are obtained via a

linear additive model:

log(PMij) = µi + αj + ǫij , i = 1, ..., I and j = 1, ..., J

with αj a probe affinity effect for probe j, µi the log scale expression level for array

i and ǫij an error term. Note that these algorithms calculate an average expression

value per probe set. By doing this, valuable information may be lost.

1.4 Processing batches

Large-scale microarray experiments are often split up into multiple phases (batches).

This occurs for example because the sample preparation is such an effort that it

cannot be done at once, or because not all samples are already available. A dataset

of 30 mouse samples hybridized to MG-U74Av2 arrays (Bystrykh et al. 2005) was

6 8 11 12 31 32 33 38 40 42 1 2 5 9 14 16 18 19 21 28 34 39 15 22 24 25 27 29 30 36

68

1012

14

BXD numbers

Log2

raw

PM

inte

nsity

Figure 1.7: Boxplots of the PM signals for 30 mouse microarrays. The three colors indicate

the three batches in which the arrays were processed.

12 General introduction

Page 14: University of Groningen Genetical genomics with Affymetrix ... · sequence is used in groups of 3 nucleotides at a time, called codons. Each possible codon represents an amino acid

1 2 3 4 5 6 78 9101112131415 16

Relative probe position5

10

15

6

8

10

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Probe

12 3 4 5 6 7 8 910 1112 13 141516

Relative probe position5

10

15

BA

4

6

8

10

12

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Probe

PM

sig

na

l

Lo

g2

PM

sig

na

l

Pro

be

Pro

be

Figure 1.8: PM signals of 2 example probe sets showing a reasonable consistent batch effect

(A) and probe-specific batch effects (B). Individuals are colored according to the batches to

which they belong.

created in three processing batches: batches 1, 2, 3 containing 10, 12 and 8 arrays

respectively. Figure 1.7 shows boxplots of all PM signals for each of the 30 arrays,

that are colored according to the batch the arrays were processed in. As can be seen

from the figure, the overall signal for all arrays in the third (rightmost) batch is lower

than the overall signal of the arrays in the other batches. Although between-array

normalization equates the overall signal of all 30 arrays, the batch effect observed is

not removed from the data by this procedure, as can be seen for an example probe set

in Figure 1.8. Figure 1.8A shows the PM signals after quantile normalization; each

line (individual) is colored according to the batch it was processed in. The batch

effect is clearly visible. Figure 1.8B shows a similar figure for another probe set.

There we see that the batch effect can be probe-specific: for probe 6 the individuals

from the light grey batch have the lowest signal, while for probes 14-16 they have the

highest signal. In Chapter 3 we’ll see how these effects can be corrected for.

1.5 Outline of the thesis

Chapter 2 introduces the concepts of genetical genomics in more detail. It states

that the design and analysis of genetical genomics experiments is not just a matter

of simply applying the existing computational tools for microarrays to this new type

of experimental data. Optimal experimental design and analysis for two major mi-

croarrays technologies, cDNA two-colour arrays and Affymetrix short oligonucleotide

Outline of the thesis 13

Page 15: University of Groningen Genetical genomics with Affymetrix ... · sequence is used in groups of 3 nucleotides at a time, called codons. Each possible codon represents an amino acid

arrays are described.

Chapter 3 focuses on a particular mouse genetical genomics dataset, where gene

expressions were measured using Affymetrix arrays. Arrays were processed in three

batches and this introduced a batch effect in the data. On places where batches

and marker alleles spuriously correlated, false QTLs were detected. By applying an

ANOVA model with a factor ’batch’, the batch effect was successfully eliminated.

Since this ANOVA model is applied to the probe level data, the interaction between

probe effect and QTL effect could be examined. It appeared that among cis genes

this interaction term is often significant, indicating that for cis genes often the QTL

effect is not consistent over probes. We show a clear example where the probe-specific

QTL effect is caused by two SNPs covering only part of the probes, and where only

difference in signal is observed in the probes that overlap the SNPs. In this case

the gene is not differentially expressed, but there is only a difference in hybridization

caused by the fact that, because of the SNPs, the RNA of one strain perfectly binds

to the array and the RNA of the other strain has a mismatch with the probes on the

array.

Chapter 4 introduces a verification protocol for the probe sequences of Affymetrix

human, mouse and rat arrays. It is known that, since the time of probe design by

Affymetrix, knowledge about messenger and genomic sequences increases. Probe sets

could have been designed on erroneous EST sequences and hence do not actually mea-

sure any occurring RNA. Furthermore the definition of genes change since UniGene

clusters are merged, redivided or deprecated over time. The verification protocol

checks all probes on Affymetrix arrays against multiple messenger databases and the

genome and updates the probe set definitions.

Chapter 5 shows that in multiple genetical genomics studies many false cis eQTLs

are detected because of polymorphisms in the probe regions. When inbred popula-

tions are used and the microarrays are designed based on the sequence of one of the

parental strains, polymorphisms between the parental strains cause an excess of cis

eQTLs where the hybridization signal of the inbred lines carrying the allele of the

strain used for array design is higher than that of inbred lines carrying the other al-

lele. This is shown to occur in multiple studies and a statistical solution implemented

in a downloadable application is provided.

Chapter 6 will present the R and perl code of the most important computational

protocols developed in this thesis.

Finally, chapter 7 combines the general discussion, dutch summary, dankwoord,

CV, publications and poster award.

14 General introduction

Page 16: University of Groningen Genetical genomics with Affymetrix ... · sequence is used in groups of 3 nucleotides at a time, called codons. Each possible codon represents an amino acid

1.6 Glossary

Term Description

DNA Abbreviation for deoxyribonucleic acid, the molecule that contains the ge-

netic code for all life forms except for a few viruses. It consists of two long,

twisted chains made up of nucleotides. Each nucleotide contains one base,

one phosphate molecule, and the sugar molecule deoxyribose. The bases in

DNA nucleotides are adenine, thymine, guanine, and cytosine.

RNA Abbreviation for ribonucleic acid, the molecule that carries out DNA’s in-

structions for making proteins. It consists of one long chain made up of

nucleotides. Each nucleotide contains one base, one phosphate molecule,

and the sugar molecule ribose. The bases in RNA nucleotides are ade-

nine, uracil, guanine, and cytosine. There are three main types of RNA:

messenger RNA, transfer RNA, and ribosomal RNA.

protein A molecule or complex of molecules consisting of subunits called amino

acids. Proteins are the cell’s main building materials, and they do most of

a cell’s work.

nucleotide A building block of DNA or RNA. It includes one base, one phosphate

molecule, and one sugar molecule (deoxyribose in DNA, ribose in RNA).

oligonucleotide A short sequence of nucleotides

microarray Microarrays consist of large numbers of molecules (often, but not always,

DNA) distributed in rows in a very small space. Microarrays permit scien-

tists to study gene expression by providing a snapshot of all the genes that

are active in a cell at a particular time.

probe Synthesized stretch of DNA to be attached to a microarray.

probe set Collection of multiple probes, intended to measure the expression of one

gene.

hybridize To anneal nucleic acid strands from different sources.

PM probe Perfect match probe, perfectly matching the target gene.

MM probe Mismatch probe, equal to the PM probe expect for the middle nucleotide

which is replaced by its complement.

QTL Quantitative Trait Locus. Location on the genome where gene(s) are lo-

cated that likely regulate some quantitative trait.

genetical

genomics

Method to identify QTL for gene expression traits.

RIL Recombinant inbred line. RILs are genetic populations whose genomes are

mixtures of two parental founder genomes.

cis-acting

QTL

QTL that colocates with it’s gene.

trans-acting

QTL

QTL that does not colocate with it’s gene.

SNP Single nucleotide polymorphism. A difference in the DNA or RNA of only

one nucleotide.

CDF Chip Definition File

Glossary 15

Page 17: University of Groningen Genetical genomics with Affymetrix ... · sequence is used in groups of 3 nucleotides at a time, called codons. Each possible codon represents an amino acid

1.7 Websites

URL Description

www.affymetrix.com Affymetrix website

www.bioconductor.org Open source software for bioinformatics

www.r-project.org The statistical programming language R

www.genenetwork.org Large collection of genotype, phenotype and

gene expression data for complex trait analysis

and gene network reconstruction

www.rgenetics.org Open source R software for the analysis of ge-

netic data

www.complextrait.org Website of the Complex Trait Consortium

16 General introduction