gene array analysis statistical genetics - class 10 gene array description normalization data...
Post on 20-Dec-2015
220 views
TRANSCRIPT
Gene Array Analysis
Statistical genetics - Class 10
Gene array description
Normalization
Data Analysis
Multiple measurements
What is a gene array
Gene arrays are solid supports upon which a collection of gene-specific nucleic acids have been placed at defined locations, either by spotting or direct synthesis.
In array analysis, a nucleic acid-containing sample is labeled and then allowed to hybridize with the gene-specific targets on the array.
Based on the amount of probe hybridized to each target spot, information is gained about the specific nucleic acid composition of the sample.
The major advantage of gene arrays is that they can provide information on thousands of targets in a single experiment.
Nomenclature
Many terms exist for naming gene arrays, including: biochip, DNA chip, GeneChip (a registered trademark of Affymetrix, Inc.), DNA array, microarray macroarray.
Microarray and macroarray may be used to differentiate between spot size or the number of spots on the support.
Glass Support
Experiment
A typical gene array experiment involves: 1. Isolating RNA from the samples to be compared 2. Converting the RNA samples to labeled cDNA via
reverse transcription; this step may be combined with aRNA amplification
3. Hybridizing the labeled cDNA to identical membrane or glass slide arrays
4. Removing the unhybridized cDNA 5. Detecting and quantitating the hybridized cDNA6. Comparing the quantitative data from the various
samples
General Picture
Choosing Cell Populations
The goal of comparative cDNA hybridization is to compare gene transcription in two or more different kinds of cells. For wxample:
Tissue-specific Genes - Cells from two different tissues (say, cardiac muscle and prostate epithelium) are specialized for performing different functions in an organism. Although we can recognize cells from different tissues by their phenotypes, it is not known just what makes one cell function as smooth muscle, another as a neuron, and still another as prostate.
Ultimately, a cell's role is determined by the proteins it produces, which in turn depend on its expressed genes. Comparative hybridization experiments can reveal genes which are preferentially expressed in specific tissues.
Choosing Cell Populations
Genetic disease is often caused by genes which are inappropriately transcribed -- either too much or too little -- or which are missing altogether.
Such defects are especially common in cancers, which can occur when regulatory genes are deleted, inactivated, or become constitutively active.
Unlike some genetic diseases (e.g. cystic fibrosis) in which a single defective gene is always responsible, cancers which appear clinically similar can be genetically heterogeneous.
For example, prostate cancer (prostatic adenocarcinoma) may be caused by several different, independent regulatory gene defects even in a single patient.
Choosing Cell Populations
Cell Cycle Variations Cells undergo DNA replication, mitosis, and eventually
death. These activities require quite different gene products, such as DNA polymerases for genome replication or microtubule spindle proteins for mitosis. A cell's genes encode the "programs" for these activities, and gene transcription is required to execute those programs. Comparative hybridization can be used to distinguish genes that are expressed at different times in the cell cycle. In this way, the pathways responsible for controlling basic life processes can be uncovered.
mRNA Extraction
Genes which code for protein are transcribed into messenger RNA's (mRNA's) in the cell nucleus. The mRNA's in turn are translated into proteins by ribosomes in the cytoplasm. The transcription level of a gene is taken to be the amount of its corresponding mRNA present in the cell. Comparative hybridization experiments compare the amounts of many different mRNA's in two cell populations.
To prepare mRNA for use in a microarray assay, it must be purified from total cellular contents. mRNA accounts for only about 3% of all RNA in a cell.
Common mRNA isolation methods take advantage of the fact that most mRNA's have a poly-adenine (poly(A)) tail. These poly(A)+ mRNA's can be purified by capturing them using complementary oligodeoxythymidine (oligo(dT)) molecules bound to a solid support.
Reverse transcription
Captured mRNA's are still difficult to work with because they are prone to being destroyed.
The environment is full of RNA-digesting enzymes, so free RNA is quickly degraded. To prevent the experimental samples from being lost, they are reverse-transcribed back into more stable DNA form. The products of this reaction are called complementary DNA's (cDNA's) because their sequences are the complements of the original mRNA sequences.
A problem with cDNA production is that not all mRNA's are reverse-transcribed with the same efficiency. This fact leads to reverse transcription bias, which can change the relative amounts of different cDNA's measured by the microarray assay.
Reverse transcription bias is not a problem when comparing the same mRNA across two cell populations unless it causes the mRNA not to be transcribed at all.
However, the bias does prohibit quantitative comparison between different mRNA's on one array.
Fluorescent labeling of cDNA's
In order to detect cDNA's bound to the microarray, we must label them with a reporter molecule that identifies their presence. The reporters currently used in comparative hybridization to microarrays are fluorescent dyes (fluors).
A differently-colored fluor is used for each sample so that we can tell the two samples apart on the array. The labeled cDNA samples are called probes because they are used to probe the collection of spots on the array.
Fluors do not show their colors unless stimulated with a specific frequency of light by a laser. Even then, the colors are not directly observed; rather, the wavelength of the emitted light is used to tune a detector which measures the fluorescence.
Normalization
The number of fluor molecules which label each cDNA depends on its length and possibly its sequence composition, both of which are often unknown.
This is one more reason that fluorescent intensities for different cDNA's cannot be quantitatively compared. However, identical cDNA's from the two probes are still comparable as long as the same number of label molecules are added to the same DNA sequence in each probe.
To equalize the total concentrations of the two cDNA probes before applying them to the array, the probe solutions are diluted to have the same overall fluorescent intensity.
This procedure makes two possibly unjustified assumptions: 1. that the total amount of mRNA in each cell type being tested is
identical2. that each fluor emits the same amount of light relative to its
concentration.
Hybridization to a DNA Microarray
The two cDNA probes are tested by hybridizing them to a DNA microarray.
The array holds hundreds or thousands of spots, each of which contains a different DNA sequence.
In this way, every spot on an array is an independent assay for the presence of a different cDNA. There is enough DNA on each spot that both probes can hybridize to it at once without interference.
Microarrays are made from a collection of purified DNA's. A drop of each type of DNA in solution is placed onto a specially-prepared glass microscope slide by an arraying machine. The arraying machine can quickly produce a regular grid of thousands of spots in a square about 2 cm on a side
Scanning the Hybridized Array
Once the cDNA probes have been hybridized to the array and any loose probe has been washed off, the array must be scanned to determine how much of each probe is bound to each spot.
The probes are tagged with fluorescent reporter molecules which emit detectable light when stimulated by a laser.
The emitted light is captured by a detector,usualy a charge-coupled device (CCD).
Spots with more bound probe will have more reporters and will therefore fluoresce more intensely.
The scanner also records light from a few molecules that hybridized either to the wrong spot or nonspecifically to the glass slide. This extra light becomes the background of the scanned array image.
Affymetrix arrays
• 107copies per oligo in 24 x 24 um square
• Use 20 pairs of different 25-mers per gene• Perfect match and mismatch
Values for genes on one array
Two color arrays: ratio of “red” to “green” intensities. Usually expressed as log(R/G).
One-color arrays (Affymetrix): “signal” – just a relative measure of expression of the gene.
Either way we have a number that represents the expression level of the gene.
To keep things simple, for two-color arrays a common reference sample R is often used for all arrays.
A/R B/R C/R D/R E/R etc. Equivalent for one color arrays: A B C D E
Data Analysis
Normalization Detection of outliers Clustering Multiple measurments
False color images of spotted array
Overlay of two scans of the slide Compares the two samples Green = less relative expression Red = more relative expresion Yellow = equal expression Dimmer colors = lower expression levels.
Normalizing two-color arrays
Due to imbalances in dye labeling, the signals for the two colors are rarely “balanced”.
There are many other sources of non-biological systematic error..
before after
Normalization
Cy3 signal (log2)
Cy5
sig
nal (
log 2
)
Normalization by iterative linear regression
fit a line (y=mx+b) to the data set
set aside outliers (residuals > 2 x s.e.)
repeat until r2 changes by
< 0.001
then apply slope and intercept to
the original dataset
D Finkelstein et al. http://www.camda.duke.edu/CAMDA00/abstracts.asp
Normalization (Linear)
Cy3 signal (log2)
Cy5
sig
nal (
log 2
)
Normalization (Linear)
Cy3 signal (log2)
Cy5
sig
nal (
log 2
)
average signal {log2 (Cy3 + Cy5)/2}
rati
o {
log
2 (C
y5 /
Cy3
)} Loess function fit line
0
Normalization (Curvilinear)
G Tseng et al., NAR 2001
Image Analysis
2 images per array
Super-imposing
Grid on image
Clone Id Ratio1 1.52 0.8… …
Gene Ratios
Gene expression levels determined by intrinsic properties of each gene
low high expression level
Gene A Gene B
Statistical Analysis
Differences in ratios due to random variation meaningful changes
Hypothesis testing, with H0: no systematic differences between ratios
Most Basic Statistical Analysis
Assumptions ‘red’ and ‘green’ intensities at a given gene
~ i.i.N.d with common variance constant coefficient of variation over the whole
gene set
Statistical Analysis
with Tk = Rk / Gk ,
22
2
22
2
12
1exp
21
11
tc
t
tc
tttf
kT
with c: coefficient of variation, estimated from data
According to Chen et al. 1997 (J Biomedical Optics, 2(4):364)
Statistical Analysis
Classification with hypothesis testingunder-expressed over-expressed
/2 /2
3 classes of genes
Average Difference
Fold Change Graphs
How many times did the expression of this gene change in the treated tissue versus the control?
comparison analysis requires experiment vs control does not apply to absolute analysis parameter value in one vs another Avg diff (perfect match vs mismatch)
Fold Change of Average Difference
Noise and Repeats
>90% 2 to 3 fold Multiplicative
noise Repeat experiments Log scale
dist(4,2)=dist(2,1)
log – log plot
Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated.
Goal B: Divide conditions to groups with similar gene expression profiles. Example: divide drugs according to their effect on gene expression.
Unsupervised Analysis
Clustering Methods
Iteration = 3
•Start with random position of K centroids.
•Iteratre until centroids are stable
•Assign points to centroids
•Move centroids to centerof assign points
Centroid Methods - K-means
Time course experiment
Application of K-means to tome course experiments
Agglomerative Hierarchical Clustering
Results depend on distance update method Single linkage: elongated clusters Complete linkage: sphere-like clusters
Greedy iterative process Not robust against noise No inherent measure to choose the clusters
Gene Expression Data
Cluster genes and conditions
2 independent clustering: Genes represented as
vectors of expression in all conditions
Conditions are represented as vectors of expression of all genes
-0.4
-0.2
0
0.2
0.4
0.6
0.8
Experiments
Ge
ne
s
Colon cancer data (normalized genes)
10 20 30 40 50 60
200
400
600
800
1000
1200
1400
1600
1800
2000
1. Identify tissue classes (tumor/normal)
First clustering - Experiments
2. Find Differentiating And Correlated Genes
Second Clustering - Genes
Ribosomal proteins Cytochrome C
HLA2
metabolism
Two-wayClustering
Coupled Two-way Clustering (CTWC)
Motivation: Only a small subset of genes play a role
in a particular biological process; the other genes
introduce noise, which may mask the signal of the
important players. Only a subset of the samples exhibit
the expression patterns of interest.New Goal: Use subsets of genes to study subsets of samples (and vice versa) A non-trivial task – exponential number of subsets.CTWC is a heuristic to solve this problem.
0 10 20 30 40 50 60
0
10
20
30
40
50
60
0 10 20 30 40 50 60
0
10
20
30
40
50
60
CTWC of Colon Cancer Data
A
B
A
B
10 20 30 40 50 60
200
400
600
800
1000
1200
1400
1600
1800
2000
(A)
(B)
Multiple Testing Problem
Simultaneously test m null hypotheses, one for each gene j
Hj: no association between expression measure of gene j and the response
Because microarray experiments simultaneously monitor expression levels of thousands of genes, there is a large multiplicity issue
Increased chance of false positives
Hypothesis Truth Vs. Decision
# not rejected # rejected totals
# true H U V
Type I error
m0
# non-true H T
Type II error
S m1
totals m - R R m
TruthDecision
Strong Vs. Weak Control
All probabilities are conditional on which hypotheses are true
Strong control refers to control of the Type I error rate under any combination of true and false nulls
Weak control refers to control of the Type I error rate only under the complete null hypothesis (i.e. all nulls true)
In general, weak control without other safeguards is unsatisfactory
Adjusted p-values (p*)
Test level (e.g. 0.05) does not need to be determined in advance
Some procedures most easily described in terms of their adjusted p-values
Usually easily estimated using resampling
Procedures can be readily compared based on the corresponding adjusted p-values
A Little Notation
For hypothesis Hj, j = 1, …, m
observed test statistic: tj
observed unadjusted p-value: pj
Ordering of observed (absolute) tj: {rj}
such that |tr1| |tr2
| … |trG|
Ordering of observed pj: {rj}
such that |pr1| |pr2
| … |prG|
Denote corresponding RVs by upper case letters (T, P)
Control of the type I errors
Bonferroni single-step adjusted p-values
pj* = min (mpj, 1)
Sidak single-step (SS) adjusted p-values
pj * = 1 – (1 – pj)m
Sidak free step-down (SD) adjusted p-values
pj * = 1 – (1 – p(j))(m – j + 1)
Control of the type I errors
Holm (1979) step-down adjusted p-values
prj* = maxk = 1…j {min ((m-k+1)prk, 1)}
Intuitive explanation: once H(1) rejected by Bonferroni, there are only m-1 remaining hyps that might still be true (then another Bonferroni, etc.)
Hochberg (1988) step-up adjusted p-values (Simes inequality)
prj* = mink = j…m {min ((m-k+1)prk, 1) }
Control of the type I errors
Westfall & Young (1993) step-down minP adjusted p-values
prj* = maxk = 1…j { p(maxl{rk…rm} Pl prk H0C )}
Westfall & Young (1993) step-down maxT adjusted p-values
prj* = maxk = 1…j { p(maxl{rk…rm} |Tl| ≥ |trk| H0C )}
Westfall & Young (1993) Adjusted p-values
Step-down procedures: successively smaller adjustments at each step
Take into account the joint distribution of the test statistics
Less conservative than Bonferroni, Sidak, Holm, or Hochberg adjusted p-values
Can be estimated by resampling but computer-intensive (especially for minP)