gene array analysis statistical genetics - class 10 gene array description normalization data...

Gene Array Analysis

Statistical genetics - Class 10

Gene array description

Normalization

Data Analysis

Multiple measurements

What is a gene array

Gene arrays are solid supports upon which a collection of gene-specific nucleic acids have been placed at defined locations, either by spotting or direct synthesis.

In array analysis, a nucleic acid-containing sample is labeled and then allowed to hybridize with the gene-specific targets on the array.

Based on the amount of probe hybridized to each target spot, information is gained about the specific nucleic acid composition of the sample.

The major advantage of gene arrays is that they can provide information on thousands of targets in a single experiment.

Nomenclature

Many terms exist for naming gene arrays, including: biochip, DNA chip, GeneChipÂ (a registered trademark of Affymetrix, Inc.), DNA array, microarray macroarray.

Microarray and macroarray may be used to differentiate between spot size or the number of spots on the support.

Glass Support

Experiment

A typical gene array experiment involves: 1. Isolating RNA from the samples to be compared 2. Converting the RNA samples to labeled cDNA via

reverse transcription; this step may be combined with aRNA amplification

3. Hybridizing the labeled cDNA to identical membrane or glass slide arrays

4. Removing the unhybridized cDNA 5. Detecting and quantitating the hybridized cDNA6. Comparing the quantitative data from the various

samples

General Picture

Choosing Cell Populations

The goal of comparative cDNA hybridization is to compare gene transcription in two or more different kinds of cells. For wxample:

Tissue-specific Genes - Cells from two different tissues (say, cardiac muscle and prostate epithelium) are specialized for performing different functions in an organism. Although we can recognize cells from different tissues by their phenotypes, it is not known just what makes one cell function as smooth muscle, another as a neuron, and still another as prostate.

Ultimately, a cell's role is determined by the proteins it produces, which in turn depend on its expressed genes. Comparative hybridization experiments can reveal genes which are preferentially expressed in specific tissues.


Genetic disease is often caused by genes which are inappropriately transcribed -- either too much or too little -- or which are missing altogether.

Such defects are especially common in cancers, which can occur when regulatory genes are deleted, inactivated, or become constitutively active.

Unlike some genetic diseases (e.g. cystic fibrosis) in which a single defective gene is always responsible, cancers which appear clinically similar can be genetically heterogeneous.

For example, prostate cancer (prostatic adenocarcinoma) may be caused by several different, independent regulatory gene defects even in a single patient.


Cell Cycle Variations Cells undergo DNA replication, mitosis, and eventually

death. These activities require quite different gene products, such as DNA polymerases for genome replication or microtubule spindle proteins for mitosis. A cell's genes encode the "programs" for these activities, and gene transcription is required to execute those programs. Comparative hybridization can be used to distinguish genes that are expressed at different times in the cell cycle. In this way, the pathways responsible for controlling basic life processes can be uncovered.

mRNA Extraction

Genes which code for protein are transcribed into messenger RNA's (mRNA's) in the cell nucleus. The mRNA's in turn are translated into proteins by ribosomes in the cytoplasm. The transcription level of a gene is taken to be the amount of its corresponding mRNA present in the cell. Comparative hybridization experiments compare the amounts of many different mRNA's in two cell populations.

To prepare mRNA for use in a microarray assay, it must be purified from total cellular contents. mRNA accounts for only about 3% of all RNA in a cell.

Common mRNA isolation methods take advantage of the fact that most mRNA's have a poly-adenine (poly(A)) tail. These poly(A)+ mRNA's can be purified by capturing them using complementary oligodeoxythymidine (oligo(dT)) molecules bound to a solid support.

Reverse transcription

Captured mRNA's are still difficult to work with because they are prone to being destroyed.

The environment is full of RNA-digesting enzymes, so free RNA is quickly degraded. To prevent the experimental samples from being lost, they are reverse-transcribed back into more stable DNA form. The products of this reaction are called complementary DNA's (cDNA's) because their sequences are the complements of the original mRNA sequences.

A problem with cDNA production is that not all mRNA's are reverse-transcribed with the same efficiency. This fact leads to reverse transcription bias, which can change the relative amounts of different cDNA's measured by the microarray assay.

Reverse transcription bias is not a problem when comparing the same mRNA across two cell populations unless it causes the mRNA not to be transcribed at all.

However, the bias does prohibit quantitative comparison between different mRNA's on one array.

Fluorescent labeling of cDNA's

In order to detect cDNA's bound to the microarray, we must label them with a reporter molecule that identifies their presence. The reporters currently used in comparative hybridization to microarrays are fluorescent dyes (fluors).

A differently-colored fluor is used for each sample so that we can tell the two samples apart on the array. The labeled cDNA samples are called probes because they are used to probe the collection of spots on the array.

Fluors do not show their colors unless stimulated with a specific frequency of light by a laser. Even then, the colors are not directly observed; rather, the wavelength of the emitted light is used to tune a detector which measures the fluorescence.

Normalization

The number of fluor molecules which label each cDNA depends on its length and possibly its sequence composition, both of which are often unknown.

This is one more reason that fluorescent intensities for different cDNA's cannot be quantitatively compared. However, identical cDNA's from the two probes are still comparable as long as the same number of label molecules are added to the same DNA sequence in each probe.

To equalize the total concentrations of the two cDNA probes before applying them to the array, the probe solutions are diluted to have the same overall fluorescent intensity.

This procedure makes two possibly unjustified assumptions: 1. that the total amount of mRNA in each cell type being tested is

identical2. that each fluor emits the same amount of light relative to its

concentration.

Hybridization to a DNA Microarray

The two cDNA probes are tested by hybridizing them to a DNA microarray.

The array holds hundreds or thousands of spots, each of which contains a different DNA sequence.

In this way, every spot on an array is an independent assay for the presence of a different cDNA. There is enough DNA on each spot that both probes can hybridize to it at once without interference.

Microarrays are made from a collection of purified DNA's. A drop of each type of DNA in solution is placed onto a specially-prepared glass microscope slide by an arraying machine. The arraying machine can quickly produce a regular grid of thousands of spots in a square about 2 cm on a side

Scanning the Hybridized Array

Once the cDNA probes have been hybridized to the array and any loose probe has been washed off, the array must be scanned to determine how much of each probe is bound to each spot.

The probes are tagged with fluorescent reporter molecules which emit detectable light when stimulated by a laser.

The emitted light is captured by a detector,usualy a charge-coupled device (CCD).

Spots with more bound probe will have more reporters and will therefore fluoresce more intensely.

The scanner also records light from a few molecules that hybridized either to the wrong spot or nonspecifically to the glass slide. This extra light becomes the background of the scanned array image.

Affymetrix arrays

• 107copies per oligo in 24 x 24 um square

• Use 20 pairs of different 25-mers per gene• Perfect match and mismatch

Values for genes on one array

Two color arrays: ratio of “red” to “green” intensities. Usually expressed as log(R/G).

One-color arrays (Affymetrix): “signal” – just a relative measure of expression of the gene.

Either way we have a number that represents the expression level of the gene.

To keep things simple, for two-color arrays a common reference sample R is often used for all arrays.

A/R B/R C/R D/R E/R etc. Equivalent for one color arrays: A B C D E

Data Analysis

Normalization Detection of outliers Clustering Multiple measurments

False color images of spotted array

Overlay of two scans of the slide Compares the two samples Green = less relative expression Red = more relative expresion Yellow = equal expression Dimmer colors = lower expression levels.

Normalizing two-color arrays

Due to imbalances in dye labeling, the signals for the two colors are rarely “balanced”.

There are many other sources of non-biological systematic error..

before after

Normalization

Cy3 signal (log2)

Cy5

sig

nal (

log 2

)

Normalization by iterative linear regression

fit a line (y=mx+b) to the data set

set aside outliers (residuals > 2 x s.e.)

repeat until r2 changes by

< 0.001

then apply slope and intercept to

the original dataset

D Finkelstein et al. http://www.camda.duke.edu/CAMDA00/abstracts.asp

Normalization (Linear)

Cy3 signal (log2)

Cy5

sig

nal (

log 2

)

average signal {log2 (Cy3 + Cy5)/2}

rati

o {

log

2 (C

y5 /

Cy3

)} Loess function fit line

0

Normalization (Curvilinear)

G Tseng et al., NAR 2001

Image Analysis

2 images per array

Super-imposing

Grid on image

Clone Id Ratio1 1.52 0.8… …

Gene Ratios

Gene expression levels determined by intrinsic properties of each gene

low high expression level

Gene A Gene B

Statistical Analysis

Differences in ratios due to random variation meaningful changes

Hypothesis testing, with H0: no systematic differences between ratios

Most Basic Statistical Analysis

Assumptions ‘red’ and ‘green’ intensities at a given gene

~ i.i.N.d with common variance constant coefficient of variation over the whole

gene set


with Tk = Rk / Gk ,

22

2

22

2

12

1exp

21

11

tc

t

tc

tttf

kT

with c: coefficient of variation, estimated from data

According to Chen et al. 1997 (J Biomedical Optics, 2(4):364)


Classification with hypothesis testingunder-expressed over-expressed

/2 /2

3 classes of genes

Average Difference

Fold Change Graphs

How many times did the expression of this gene change in the treated tissue versus the control?

comparison analysis requires experiment vs control does not apply to absolute analysis parameter value in one vs another Avg diff (perfect match vs mismatch)

Fold Change of Average Difference

Noise and Repeats

>90% 2 to 3 fold Multiplicative

noise Repeat experiments Log scale

dist(4,2)=dist(2,1)

log – log plot

Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated.

Goal B: Divide conditions to groups with similar gene expression profiles. Example: divide drugs according to their effect on gene expression.

Unsupervised Analysis

Clustering Methods

Iteration = 3

•Start with random position of K centroids.

•Iteratre until centroids are stable

•Assign points to centroids

•Move centroids to centerof assign points

Centroid Methods - K-means

Time course experiment

Application of K-means to tome course experiments

Agglomerative Hierarchical Clustering

Results depend on distance update method Single linkage: elongated clusters Complete linkage: sphere-like clusters

Greedy iterative process Not robust against noise No inherent measure to choose the clusters

Gene Expression Data

Cluster genes and conditions

2 independent clustering: Genes represented as

vectors of expression in all conditions

Conditions are represented as vectors of expression of all genes

-0.4

-0.2

0

0.2

0.4

0.6

0.8

Experiments

Ge

ne

s

Colon cancer data (normalized genes)

10 20 30 40 50 60

200

400

600

800

1000

1200

1400

1600

1800

2000

1. Identify tissue classes (tumor/normal)

First clustering - Experiments

2. Find Differentiating And Correlated Genes

Second Clustering - Genes

Ribosomal proteins Cytochrome C

HLA2

metabolism

Two-wayClustering

Coupled Two-way Clustering (CTWC)

Motivation: Only a small subset of genes play a role

in a particular biological process; the other genes

introduce noise, which may mask the signal of the

important players. Only a subset of the samples exhibit

the expression patterns of interest.New Goal: Use subsets of genes to study subsets of samples (and vice versa) A non-trivial task – exponential number of subsets.CTWC is a heuristic to solve this problem.

0 10 20 30 40 50 60

0

10

20

30

40

50

60

0 10 20 30 40 50 60

0

10

20

30

40

50

60

CTWC of Colon Cancer Data

A

B

A

B

10 20 30 40 50 60

200

400

600

800

1000

1200

1400

1600

1800

2000

(A)

(B)

Multiple Testing Problem

Simultaneously test m null hypotheses, one for each gene j

Hj: no association between expression measure of gene j and the response

Because microarray experiments simultaneously monitor expression levels of thousands of genes, there is a large multiplicity issue

Increased chance of false positives

Hypothesis Truth Vs. Decision

# not rejected # rejected totals

# true H U V

Type I error

m0

# non-true H T

Type II error

S m1

totals m - R R m

TruthDecision

Strong Vs. Weak Control

All probabilities are conditional on which hypotheses are true

Strong control refers to control of the Type I error rate under any combination of true and false nulls

Weak control refers to control of the Type I error rate only under the complete null hypothesis (i.e. all nulls true)

In general, weak control without other safeguards is unsatisfactory

Adjusted p-values (p*)

Test level (e.g. 0.05) does not need to be determined in advance

Some procedures most easily described in terms of their adjusted p-values

Usually easily estimated using resampling

Procedures can be readily compared based on the corresponding adjusted p-values

A Little Notation

For hypothesis Hj, j = 1, …, m

observed test statistic: tj

observed unadjusted p-value: pj

Ordering of observed (absolute) tj: {rj}

such that |tr1| |tr2

| … |trG|

Ordering of observed pj: {rj}

such that |pr1| |pr2

| … |prG|

Denote corresponding RVs by upper case letters (T, P)

Control of the type I errors

Bonferroni single-step adjusted p-values

pj* = min (mpj, 1)

Sidak single-step (SS) adjusted p-values

pj * = 1 – (1 – pj)m

Sidak free step-down (SD) adjusted p-values

pj * = 1 – (1 – p(j))(m – j + 1)


Holm (1979) step-down adjusted p-values

prj* = maxk = 1…j {min ((m-k+1)prk, 1)}

Intuitive explanation: once H(1) rejected by Bonferroni, there are only m-1 remaining hyps that might still be true (then another Bonferroni, etc.)

Hochberg (1988) step-up adjusted p-values (Simes inequality)

prj* = mink = j…m {min ((m-k+1)prk, 1) }


Westfall & Young (1993) step-down minP adjusted p-values

prj* = maxk = 1…j { p(maxl{rk…rm} Pl prk H0C )}

Westfall & Young (1993) step-down maxT adjusted p-values

prj* = maxk = 1…j { p(maxl{rk…rm} |Tl| ≥ |trk| H0C )}

Westfall & Young (1993) Adjusted p-values

Step-down procedures: successively smaller adjustments at each step

Take into account the joint distribution of the test statistics

Less conservative than Bonferroni, Sidak, Holm, or Hochberg adjusted p-values

Can be estimated by resampling but computer-intensive (especially for minP)

gene array analysis statistical genetics - class 10 gene array description normalization data...

Documents

gene array gene arrays

collection of gene

different gene products

genespecific targets

naming gene arrays

typical gene array experiment

single defective gene

dna array