statistical methods for microarrays christina kendziorski landon sego department of biostatistics...
TRANSCRIPT
![Page 1: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/1.jpg)
Statistical Methods for Microarrays
Christina Kendziorski
Landon Sego
Department of Biostatistics and Medical InformaticsUniversity of Wisconsin-Madison
![Page 2: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/2.jpg)
BASIC BIOLOGY
![Page 3: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/3.jpg)
Introduction to Basic Biology and Microarray Experiments
• What is a DNA microarray measuring?
Gene expression.
• The novelty of a microarray is that it quantifies the abundance of thousands of genes simultaneously—which gives biologists a global perspective.
• The biological processes that give rise to microarray data can be viewed as information transfer processes.
• Data collection for microarray experiments is not a trivial task and requires imaging technology and image processing tools.
Nguyen, et al. 2002
![Page 4: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/4.jpg)
Review of DNA molecule
Nguyen, et al. 2002
![Page 5: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/5.jpg)
The Central Dogma of Molecular Biology
![Page 6: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/6.jpg)
Terminology
• Amino acid The basic building block of proteins (or polypeptides)
• mRNA Messenger RNA is an RNA strand complementary to a DNA template
• TranscriptionThe process where the DNA template is copied/transcribed to mRNA
• Gene expression A gene is expressed if its DNA has been transcribed to RNA—gene
expression is the level of transcription of the DNA of the gene
• RT Reverse transcription is an experimental procedure to synthesize a DNA
strand (cDNA) which is complementary to a mRNA template
Nguyen, et al. 2002
![Page 7: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/7.jpg)
Terminology
• cDNA/cRNA Complementary DNA is synthesized from mRNA during RT and, similarly, in the
context of oligo arrays, complementary RNA is RNA synthesized during in vitro transcription
• dNTP Deoxyribo nucleoside triphosphate; denotes any of dUTP, dTTP, dATP, or dGTP;
molecular building blocks for making DNAs in RT, PCR, or in vitro replication; free dNTP’s in solution (which are not yet incorporated into the nucleic acid strand) have three phosphates which provide the necessary energy for cDNA synthesis
Nguyen, et al. 2002
![Page 8: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/8.jpg)
Terminology
• Primer A short, single strand of RNA or DNA that can initiate chain growth from a template
• Oligo(dT) Primer with sequence TTTT… used to initiate cDNA during RT
• Reverse transcriptase An enzyme that catalyzes the synthesis of cDNA during RT
• Poly(A) tail A sequence of A (AAA …) at the 3' end of mRNA; oligo(dT) is used in
RT to recognize mRNA by its poly(A) tail
• Target cDNAs Mixture of cDNAs obtained from the experiment and reference mRNAs
Nguyen, et al. 2002
![Page 9: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/9.jpg)
Terminology
• Probe cDNAs Immobilized cDNA printed on the array
• Hybridization Process of bringing into contact the target and probe for binding in microarrays—also refers to the binding of two DNA strands generally
• PCR Polymerase chain reaction is a procedure to amplify a segment of DNA—mass-
replication of a segment of DNA
• Oligonucleotide A short fragment of DNA (usually in single- stranded form) which is often chemically synthesized. It can be used as a probe or primer. 'Oligo' is Greek for 'few'.
Nguyen, et al. 2002
![Page 10: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/10.jpg)
The Central Dogma of Molecular Biology
![Page 11: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/11.jpg)
Basic model for gene expression
• Two different levels of gene expression:
Transcription level—where RNA is made from DNA.
Translation level—where protein is made from mRNA.
• Microarrays measure gene expression at the transcription level.
Nguyen, et al. 2002
DNA mRNAamino acid protein
cell phenotype organism phenotype
transcription translation
![Page 12: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/12.jpg)
Gene expression and abundance
• Quantification of mRNA abundances
Quantification of amount of gene expression
• Gene is expressed if its DNA has been transcribed to RNA
• A “high level of expression” would imply transcription has occurred many times and there are many copies of mRNA in the tissue.
• “low level of expression” implies fewer copies of mRNA
Nguyen, et al. 2002
![Page 13: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/13.jpg)
DNA transcription
• DNA transcription is the information transfer process directly relevant to DNA microarray experiments because quantification of the type and amount of this copied information is the goal of the microarray experiment.
• Transcription occurs in 3 stages: initiation, elongation, and termination.
• After transcription, the mRNA is further processed by removing non-coding segments, called introns.
Nguyen, et al. 2002
![Page 14: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/14.jpg)
DNA transcription - Initiation
http://www.brooklyn.cuny.edu/bc/ahp/BioInfo/graphics/Transcription.02.GIF
• Promoter regions on the DNA chain provide the signal for the initiation of transcription. Promoter regions recruit an enzyme (protein) called RNA polymerase II to the transcription initiation site.
![Page 15: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/15.jpg)
DNA transcription - Elongation
http://www.brooklyn.cuny.edu/bc/ahp/BioInfo/graphics/Transcription.02.GIF
• During elongation, the RNA polymerase moves along the DNA and extends the RNA chain by adding free nucleotides with base A, G, C, or U to match the T, C, G, or A nucleotides of the DNA template strand, respectively.
![Page 16: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/16.jpg)
DNA transcription – Termination and processing
• When the RNA polymerase reaches the template strand signal for termination, the newly synthesized RNA is released from the DNA template.
• Before the message is transported to the cytoplasm, some important posttranscriptional processing occurs.
• For example, a sequence of A’s is added to the RNA strand at the 3' end. This sequence of A’s is called the poly(A) tail.
• Non-coding regions of the mRNA (called introns) are removed in a process called splicing.
Nguyen, et al. 2002
![Page 17: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/17.jpg)
DNA transcription and RNA processing
Nguyen, et al. 2002
![Page 18: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/18.jpg)
One gene ≠ One protein
• Relationship between protein and mRNA is not one to one—so the simplified model shown on the previous slide is only an approximation.
• Exon: Coding region of DNA
• Intron: Non-coding region
Gene 1 Gene 2
template DNA strand
![Page 19: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/19.jpg)
1 gaattccacattgtttgctgcacgttggattttgaaatgctagggaactttgggagactc61 atatttctgggctagaggatctgtggaccacaagatctttttatgatgacagtagcaatg
421 gagctacaagggcctggtgcatccagggtgatctagtaattgc agaacagcaagtgct ag481 ctctccctccccttccacagctctgggtgtgggagggggttgtccagcctccagcagcat541 ggggagggccttggtcagcctctgggtgccagcagggcaggggcggagtcctggggaatg601 aaggttttatagggctcctgggggaggctccccagccccaagcttaccacctgcacccgg661 agagctgtgtcaccatgtgggtcccggttgtcttcctcaccctgtccgtgacgtggattg721 gtgagaggggccatggttggggggatgcaggagagggagccagccctgactgtcaagctg781 aggctctttcccccccaacccagcaccccagcccagacagggagctgggctcttttctgt
6301 cctagagaaggctgtgagccaaggagggagggtcttcctttggcatgggatggggatgaa6361 gtaaggagagggactggaccccctggaagctgattcactatggggggaggtgtattgaag6421 tcctccagacaaccctcagatttgatgatttcctagtagaactcacagaaataaagagct6481 cttatactgt
...
...
Success Story: KLK3 (PSA Gene)
![Page 20: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/20.jpg)
Success Story
Dhanasekaran et al., Nature, 2001
![Page 21: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/21.jpg)
GETTING EXPRESSION MEASUREMENTS:
cDNA ARRAYS and AFFY CHIPS
![Page 22: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/22.jpg)
cDNA Microarray Experimental Procedure: Overview
• Given a biological sample of cells, there are two possibilities: a set of genes is either expressed in the cells or it is not.
• cDNA arrays are designed to measure the expression of cells in an experimental sample relative to a reference (control) sample.
• In current practice, cDNAs from the experimental and reference samples are labeled with different fluorescent dyes, mixed, and hybridized onto the array.
• The measured fluorescence intensity for each sample is assumed to be proportional to transcript abundance (of course, this is conditional on factors such as spot characteristics, hybridization efficiency, level of dye incorporation, etc.).
Nguyen, et al. 2002
![Page 23: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/23.jpg)
mRNA
LabelledcDNA
Tissue Sample
Microarray
2
1
xx
DataNominal
Level1
2
cDNA Array Data
![Page 24: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/24.jpg)
cDNA Microarray Experimental Procedure
1. Fabrication of array: preparing glass slide, selecting probe DNA sequences, and depositing (printing) the probe cDNA onto the slide.
2. Sample preparation: Isolating total RNA (mRNA and other RNAs) from experimental and reference samples of interest.
3. cDNA synthesis and labeling: making cDNAs from the experimental and reference samples and labeling each sample with a fluorescent dye.
4. Hybridization: applying experimental and reference cDNA mixture to the array, letting the target and probe cDNA bind, then washing off the excess.
5. Data collection: measurement of fluorescent intensities using a confocal microscope.
Nguyen, et al. 2002
![Page 25: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/25.jpg)
Fabrication of the array
• What cDNA sequences (probes) should be printed on the array?
• Ideally, all genes would be printed. But in most cases we don’t know the sequences of all genes.
• cDNA libraries (GenBank, UniGene etc.) can be used to select the cDNA sequences.
• Sometimes cDNAs may be spotted (printed) onto the array without knowing what gene the cDNA corresponds to.
• Sometimes pieces of expressed gene, known as an expressed sequence tag, (EST) can be spotted onto the array.
Nguyen, et al. 2002
![Page 26: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/26.jpg)
Fabrication of the array
The MGuide. Version 2.0 The Brown Lab's complete guide to microarraying for the molecular biologist. Parts list Drawings for custom parts Assembly Guide: Step-by-Step Download software Online Software Documentation Print Tip Gallery Protocols Re-Purify Your Cy-Dyes. MicroArray Forum. NEW!
http://cmgm.stanford.edu/pbrown/mguide
![Page 27: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/27.jpg)
Fabrication of the array
• The array itself is a pre-treated glass slide to which the cDNA probes will be attached.
• The selected cDNA sequences are amplified (mass-replicated) using PCR.
• After amplification, the solution containing the amplified cDNA probes is deposited on the array using a set of microspotting pins.
• Ideally, the amount solution deposited by the pins should be uniform—but this is not completely achieved in practice:
Nguyen, et al. 2002
Glass slides and/or treatment not uniform There are pin effects Spots are not uniform.
![Page 28: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/28.jpg)
Fabrication of the array
• The drops of solution containing the cDNA probes form the spots on the array—each spot corresponding to a gene, EST, other.
• As a product of PCR, the cDNA probes that are spotted onto the array are double stranded.
• When the target cDNA is applied to the array, the double stranded probes are denatured (separated) in a heating process to allow the target cDNA to bond to the probe strands.
Nguyen, et al. 2002
![Page 29: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/29.jpg)
Sample preparation
• mRNA is extracted from tissue samples. For example:
• During the sample preparation process, operator differences and heterogeneous tissue can significantly contribute to the variability.
• However, these factors are not normally considered in microarray studies.
Nguyen, et al. 2002
one sample from tumor tissue (experimental sample)
one sample from normal tissue (reference sample).
![Page 30: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/30.jpg)
Synthesis, labeling, and hybridization
http://www.stat.berkeley.edu/users/terry/zarray/Html/image.html
![Page 31: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/31.jpg)
Simplified summary of cDNA synthesis and labeling
• RNA is isolated from experimental and reference cell pools.
• Free nucleotides (dNTPs), oligo(dT), and reverse transcriptase are added to the solution of total RNA to initiate cDNA synthesis.
• Fluorescent dye molecules (labels) are incorporated into the cDNA.
• Typically, the cDNA from the experimental sample is dyed red (Cy5) and cDNA from the reference sample is dyed green (Cy3).
Nguyen, et al. 2002
![Page 32: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/32.jpg)
• There are several methods for adding the labels to the cDNA:
• The gene expression measurement is affected by the labeling method used.
Dye incorporation
Nguyen, et al. 2002
direct incorporation labeling method
Amino-modified (amino-allyl) nucleotide method
primer tagging method
![Page 33: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/33.jpg)
Direct Incorporation Method
Nguyen, et al. 2002
• RNA is isolated from experimental and reference cell pools.
• The following ingredients are added to the solution of total RNA to initiate cDNA synthesis:
• Experimental and reference solutions are mixed and hybridized onto the array.
1. Oligo(dT)2. Reverse transcriptase3. Free nucleotides: dATP, dCTP, dGTP, and dTTP4. Labeled uracil nucleotides: dUTPs with a dye
molecule attached. Cy5-dUTP (red) is added to the experimental solution and Cy3-dUTP (green) is added to the reference solution.
![Page 34: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/34.jpg)
Amino Modified Nucleotide Method
Nguyen, et al. 2002
• RNA is isolated from experimental and reference cell pools.
• The following ingredients are added to the solution of total RNA to initiate cDNA synthesis:
• After cDNA synthesis, Cy5 and Cy3 are added to the experimental and reference solutions, respectively. The dye couples with the amino-modified dUTP nucleotides.
• Experimental and reference solutions are mixed and hybridized onto the array.
1. Oligo(dT)2. Reverse transcriptase3. Free nucleotides: dATP, dCTP, dGTP, and dTTP4. Modified amino-allyl dUTPs are added to both
experimental and reference solutions.
![Page 35: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/35.jpg)
Primer Tagging Method
Nguyen, et al. 2002
• RNA is isolated from experimental and reference cell pools.
• The following ingredients are added to the solution of total RNA to initiate cDNA synthesis:
• After cDNA synthesis, experimental and reference solutions are mixed and hybridized onto the array.
• After washing, the array is incubated with Cy5- and Cy3-labeled molecules called dendrimers. The dendrimers attach to the corresponding capture sequences.
1. Reverse transcriptase2. Free (unlabeled and unmodified) nucleotides:
dATP, dCTP, dGTP and dTTP3. Oligo(dT) primer with capture sequence TTTT----
for experimental sample and capture sequence TTTT+++ for reference sample.
![Page 36: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/36.jpg)
• In all three methods, the red and green dyes may incorporate unequally. Likewise, spectral overlap can occur between the fluorescence of the two dyes.
• In the direct incorporation method, the Cy3-dUTP and Cy5-dUTP molecules exhibit some steric hindrance which contributes to nonefficient and nonuniform incorporation of the dye into the cDNA.
• In the amino-modified method, the amino-allyl is a smaller molecule with less steric hindrance, and so the amino modified dUTP are uniformly incorporated the into the cDNAs with higher frequency than the direct incorporation method.
Comparing the dye incorporation methods
Nguyen, et al. 2002
![Page 37: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/37.jpg)
• In both the direct incorporation and amino-modified methods, the abundance of labeled uracil nucleotides is influenced by the composition as well as the length of the cDNA strand.
• The resulting fluorescent intensity depends on the abundance of uracil nucleotides that were incorporated into the cDNA strand.
• The primer tagging method attempts to correct this problem by attaching one dendrimer to each cDNA strand. Each dendrimer contains approximately 250 fluorescent Cy5 or Cy3 molecules. Hence there is approximately one intensity signal per cDNA molecule.
Comparing the dye incorporation methods
Nguyen, et al. 2002
![Page 38: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/38.jpg)
cDNA Microarray Experimental Procedure: Hybridization
• The solutions containing the experimental and reference labeled cDNAs are mixed and applied to the array, which contains the probe cDNAs in each spot.
• Target and probe sequences bind by base pairing (hybridization). Note that binding can occur between sequences that are similar but not identical (cross-hybridization).
• After sufficient time is allowed for hybridization, the array then goes through a series of washes to eliminate all unbound target cDNA’s an solution.
• The washing procedure must be stringent enough to remove all extraneous material but at the same time not remove the bound cDNAs—the signals of interest.
Nguyen, et al. 2002
![Page 39: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/39.jpg)
cDNA Microarray Experimental Procedure: Concepts
• Consider one spot on the array. This spot contains cDNA probes for a gene of interest, say gene A.
• If there are target cDNAs in the mixed solution complementary to the probe cDNAs of gene A, they should bind together by base pairing (hybridization).
• If, for example, gene A is expressed in both the experimental and reference samples, we expect that cDNA from both samples to bind with the probe cDNA—this spot will then show both red and green fluorescence.
Nguyen, et al. 2002
![Page 40: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/40.jpg)
cDNA Microarray Experimental Procedure: Data Collection
• After the slides are prepared and the hybridization step is complete, the expression level of each gene is measured.
• The expression levels of a gene in the experimental or reference cells are measured by the spot intensities of the fluorescent dyes. We assume that a spot of high fluorescence indicates high expression of the corresponding gene.
• The array is scanned using a confocal laser microscope. Images of each spot on the array are produced, processed, and analyzed to measure the expression of each gene.
Nguyen, et al. 2002
![Page 41: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/41.jpg)
cDNA Microarray Experimental Procedure
Nguyen, et al. 2002
![Page 42: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/42.jpg)
cDNA Microarray Experimental Procedure – Image quality
• There are a number of factors that influence the image quality, such as noise fluorescence (fluorescence from non-dye sources), pollution of the fluorescent signal, photo-bleaching, etc.
• Raw data consists of two images, one image obtained from the red channel and one from the green channel.
• Which pixels in the target area represent signal, and which represent background? Pixels must be categorized as one or the other.
• A measurement is made of the intensity of the fluorescence for the spot and the intensity of the noise fluorescence from the background.
Nguyen, et al. 2002
![Page 43: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/43.jpg)
cDNA Microarray – Quantifying gene expression
Cy5 (red) signal intensities Cy5 (red) background intensities1 2 . . . . . . n
12...m
= RnmI
Rijx
for i = 1,…,n samples, j = 1,…,m genes
Notation for Cy3 (green) intensities is the same with the exception of a G superscript.
1 2 . . . . . . n12...m
= RnmB
Rijb
![Page 44: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/44.jpg)
cDNA Microarray – Quantifying gene expression
• Many analyses are based on background corrected intensities using the measurements
and
• Intensity ratios are also used:
• Negative intensities (where the background is stronger than the signal) can occur—these issues will be discussed later….
Rij
Rijij bxr G
ijGijij bxg
ij
ij
gr
( Represents the abundance of gene expression in experimental sample relative to reference sample )
![Page 45: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/45.jpg)
cDNA Microarray Experimental Procedure
Multiple sources of variability: biological and technical
• The multiple experimental and biological processes in the cDNA microarray procedure each contribute to the overall variability.
• For example, a significant amount of variability may be introduced during the selection of the tissue samples.
• A significant amount of variability also arises during measurement and image processing.
• It is not easy to identify which portions of the microarray procedure are contributing most to the overall variability—and often times the variability at a given step of the procedure is poorly understood.
![Page 46: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/46.jpg)
mRNA
LabelledcDNA
Tissue Sample
Microarray
2
1
xx
DataNominal
Level1
2
Recap: cDNA Array Data
![Page 47: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/47.jpg)
Affymetrix Chip Experimental Procedure
• Affymetrix is another type of commonly used microarray technology.
• Rather than attach the pre-synthesized cDNA probes to the chip, oligonucleotides are chemically synthesized (grown) directly on the chip.
• An oligonucleotide is a short fragment of DNA (usually in single-stranded form) which is often chemically synthesized. It can be used as a probe or primer. 'Oligo' is Greek for 'few'.
• The oligonucleotides are synthesized to match known gene sequences. The process of synthesizing the oligonucleotides conceptually resembles semi-conductor fabrication by using masks, light exposure, and deposition of nucleotides.
![Page 48: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/48.jpg)
Affymetrix Chip Experimental Procedure
![Page 49: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/49.jpg)
Affymetrix Chip Experimental Procedure
• Each gene or EST (expressed sequence tag) is represented on the array by 11-20 features. Each feature consists of an oligonucleotide that is a perfect match (PM) to a segment of a gene.
• For each PM, there is a corresponding oligo that is identical to the PM except for a single mismatch (MM) at the central base of the oligonucleotide.
Nguyen, et al. 2002
![Page 50: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/50.jpg)
Affymetrix – Identifying gene expression
• Only one tissue sample is applied to each Affymetrix chip.
• The Affymetrix chip currently has 11 PM features for each gene. These 11 PM features serve as unique sequence detectors and the corresponding 11 MM features serve as controls.
• Under relatively ideal conditions, when the gene is expressed in the cell sample, high intensity is expected for the PM feature and low intensity for the MM feature.
• It is assumed that differences observed between the PM and MM feature intensities are due to hybridization kinetics of the different feature sequences and nonspecific background RNA hybridizations.
Nguyen, et al. 2002
![Page 51: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/51.jpg)
Affymetrix – Sample labeling
• Affymetrix uses a one-color detection scheme (one sample on one array).
• Targets are biotin labeled cRNA, rather than cDNA.
• Double stranded cDNA is synthesized using RT. Then the targets, biotinylated cRNA, are synthesized using in vitro transcription.
• Biotinylated cRNA are cRNA that have biotin molecules attached to them.
• The biotinylated cRNA is fragmented (to reduce segment length) and hybridized to the array.
• The slide is washed and fluorescent dye is applied. The dye couples with the biotin on the cRNA.
Nguyen, et al. 2002
![Page 52: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/52.jpg)
Recap: Affymetrix Chip Experimental Procedure
![Page 53: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/53.jpg)
Affymetrix – Advantages over cDNA Micorarrays
• With the one-dye system, unequal incorporation between dyes is not an issue—nor is spectral overlap between dyes.
• The reference sample is no longer needed. This reduces the required biological materials needed for experiments.
• The possibility of genomic DNA being labeled during RT is avoided because the modified biotinylated nucleotides are incorporated during IVT.
• Fragmentation of the target cRNAs ensures that the most target lengths are within a reasonable range, thus avoiding target folding.
Nguyen, et al. 2002
![Page 54: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/54.jpg)
Affymetrix – Image processing
• Affymetrix software uses a gridding procedure to locate the features on the array.
• Each feature consists of about 64 pixels.
• The features are scanned and the intensity value for a feature is computed as the 75th percentile of the intensities for the pixels in that feature (excluding the boundary pixels).
• Signal intensities are corrected for background noise intensities.
Nguyen, et al. 2002
![Page 55: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/55.jpg)
Affymetrix – Quantifiying gene expression levels
• Recall that each gene consists of k = 1,…,K features, each feature having a pair of perfect match and mismatch intensity measurements ( PMk , MMk ).
• Average Difference (AvDiff) method--for each gene:
• Most reported results from Affymetrix arrays are based on analyses that use AvDiff but with various ways to filter the outliers.
Nguyen, et al. 2002
Calculate the difference for each feature: dk = PMk - MMk
Remove all dk that exceed 3 SD of the trimmed mean (trimmed mean calculated by excluding the largest and smallest dk)
Take average of the remaining dk
![Page 56: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/56.jpg)
Affymetrix – Quantifiying gene expression levels
• As another way to measure gene expression, Efron et al. (2001) investigated
avg{dk = log(PMk) – clog(MMk), k = 1,…,K}
for various scale factors c.
• Yet another approach to measure gene expression:
where CTk is the change threshold and wk is a weight.
• Naef et al. note that the information content of MM features is not clear, they proposed expression indexes using only the PM features.
K
kkkk CTPMw
KSignal
1
)log(1
exp
![Page 57: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/57.jpg)
Affymetrix – Quantifiying gene expression levels
From Statistical Algorithms Reference Guide, Affymetrix, 2002:
“When the mismatch intensity is lower than the perfect match intensity, then the mismatch is informative and provides an estimate of stray signal. Rules are employed to ensure that negative signal values are not calculated. Negative values do not make physiological sense, and make further data processing, such as log transformations, difficult.”
![Page 58: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/58.jpg)
PROCESSING IMAGE FILES TO GIVE ROBUST ESTIMATES OF
INTENSITY
![Page 59: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/59.jpg)
cDNA Microarray – Quantifying gene expression
Cy5 (red) signal intensities Cy5 (red) background intensities1 2 . . . . . . n
12...m
= RnmI
Rijx
for i = 1,…,n samples, j = 1,…,m genes
Notation for Cy3 (green) intensities is the same with the exception of a G superscript.
1 2 . . . . . . n12...m
= RnmB
Rijb
![Page 60: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/60.jpg)
cDNA Microarray – Quantifying gene expression
• Many analyses are based on background corrected intensities using the measurements
and
• Intensity ratio is of interest:
• Represents the abundance of gene expression in experimental sample relative to reference sample
Rij
Rijij bxr G
ijGijij bxg
ij
ij
gr
![Page 61: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/61.jpg)
cDNA Microarray
http://www.beatson.gla.ac.uk/infrastructure.htm
Microarray technology. Figure showing part of a 10,000 element cDNA microarray hybridised with a cancer cell line RNA labelled with the fluorescent marker, Cy3 (green) and a control, labelled with Cy5 (red). This identifies genes with differential expression. (Dr N.I. Barr)
![Page 62: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/62.jpg)
Considerations when obtaining data
One array:
One spot A:
One spot B: Given pixel information, how to assign a summary foreground and background value? How do you combine the two to give an intensity estimate?
Assigning coordinates to each spot (and measuring distances between spots)
Which pixels to use as signal and background?
![Page 63: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/63.jpg)
Considerations when obtaining data
Addressing or gridding: Assign coordinates to each spot. Obtain foreground gridlines and background grid lines.
Segmentation: Classify pixels as foreground or background.
Intensity Extraction: Given pixel information, calculate summary measures for foreground and background.
Background Correction: Correct foreground for background to give estimate of intensity.
![Page 64: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/64.jpg)
Addressing or gridding
21
19
one grid
Format of array is known:
Distance between rows and columns of gridsTranslation of gridDistance between rows and columns of spots within each gridTranslation of spotsOverall position of array in the imageRotation of array
![Page 65: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/65.jpg)
Addressing or gridding
• Addressing records all of this information.
• Important information for use later is foreground grid lines and background grid lines.
• fgl show spot locations
• bgl separate spots
![Page 66: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/66.jpg)
Segmentation
Fixed Circle Segmentation: Fix diameter for every spot and draw circles of this diameter.
Adaptive Circle Segmentation: Allow diameters to change from spot to spot. Takes a long time.
Adaptive Shape Segmentation
Histogram Segmentation
![Page 67: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/67.jpg)
Adaptive Shape Segmentation (SRG)
SRG = seeded region growing
1. Start with collection of background and foreground pixels (seeds).
2. Identify neighboring pixels for each collection and calculate the mean intensity for background neighbors and foreground neighbors.
3. Identify pixel in foreground neighbors that is closest to foreground mean and include it in collection. Do the same for background.
4. Continue.
![Page 68: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/68.jpg)
Selection of foreground and background seeds
http://stat-www.berkeley.edu/users/terry/zarray/Talks/image/image102000.ppt
• Foreground seeds are chosen by finding the maximum of the combined intensity surface over a small region centered within the square (single point within the square).
• Background seeds are constructed as crosses based on the fitted background grid.
![Page 69: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/69.jpg)
Histogram Segmentation
• Target mask chosen larger than spot.
• Chen, et al.
• 8 random samples from patch.
• Lowest 8 from mask.
• WRS Reject Signal defined to be 8 values from mask and all pixels in mask with intensities ≥ smallest of the 8.
• Do not reject repeat with some number of the 8 masked values replaced with pixels of higher intensity.
Target site
Target mask
Target patch
![Page 70: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/70.jpg)
Histogram Segmentation
Advantage: Simple Disadvantage: Large mask might include other spots
Plot histogram of pixels in mask:
Average used as background
Average used as foreground
5th % 20th % 80th % 95th %
![Page 71: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/71.jpg)
Segmentation
Methods Software
Fixed Circle ScanAlyze, GenePix, QuantArray
Adaptive Circle GenePix
Adaptive Shape Spot, SRG & one other method
Histogram QuantArray
Histogram method and Chen, et al. method give you summary measures of pixel intensities. Other methods simply divide the pixels into foreground and background.
![Page 72: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/72.jpg)
Different background adjustment methods
http://stat-www.berkeley.edu/users/terry/zarray/Talks/image/image102000.ppt
• Region inside red circle represents the spot mask.
• Local background calculation by different methods:
Green: used in QuantArray;
Blue: used in ScanAlyze;
Pink: used in Spot.
![Page 73: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/73.jpg)
Different background adjustment methods
• Histogram based techniques give foreground and back ground measures directly.
• Other methods simply divide information into foreground pixels and background pixels and you need to perform the calculations.
• Some give summary measures of foreground and background intensity.
![Page 74: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/74.jpg)
Estimating background
ScanAlyze: Median of all pixels outside spot mask (circle) that are within square centered at spot center.
QuantArray: Median of all pixels in concentric circles outside of (and some distance from) spot mask.
GenePix: Median of all pixels in “valleys” surrounding spot mask.
Spot: Option implemented in GenePix and other options that utilize information (e.g. morphological opening) from the entire array to give spot specific adjustment.
![Page 75: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/75.jpg)
Morphological opening:
• Image with spot intensities removed is estimated. This provides estimate of background for the entire slide.
• Background is value of this image at spot center.
![Page 76: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/76.jpg)
Comparison of methods considers different background adjustments
• Methods considered are implemented in the packages. Broadly classified into 4 categories:
• Unless otherwise specified, spot foreground intensities are calculated by taking the mean intensity of the pixels within the spot mask.
1. Local adjustment: use median intensity of pixels in region just outside spot mask
2. Morphological opening
3. Constant background
4. No adjustment
![Page 77: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/77.jpg)
Description of image analysis methods
QA.fix.nbg
QA.hist.nbg
GP
Software: GenePix.Segmentation: Spot intensity is the mean of pixel values
between the 45th and 85th percentiles within a fixed circle of 9 pixels in diameter.
Background: None.
Software: QuantArray.Segmentation: Spot intensity is the mean of pixel values
between the 80th and 95th percentile of a 11-by-11 pixels square.
Background: None.
Software: GenePix.Segmentation: Proprietary algorithm that results in
adaptively sized circles.Background: Median from “valley of spot”.
http://www.stat.berkeley.edu/users/terry/zarray/Html/image.html
![Page 78: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/78.jpg)
Description of image analysis methods
SA
S.morph
Software: ScanAlyze.Segmentation: Fixed circles, 10 pixels in diameter.Background: Median value in local square region
Software: Spot.Segmentation: Seeded region growing.Background: Based on morphological opening. The
structuring element is a square region with sides of length 2.5 times the approximate spot to spot separation.
http://www.stat.berkeley.edu/users/terry/zarray/Html/image.html
![Page 79: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/79.jpg)
Comparing image analysis methods
Foreground intensities
Background intensities
http://stat-www.berkeley.edu/users/terry/zarray/Talks/image/image102000.ppt
![Page 80: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/80.jpg)
Comparing image analysis methods
Some observations:
• Higher intensities show tighter correlation.
• Background estimates: low correlation implies very little useful information.
• SA local median has smallest variability followed by S.morph and GP. QA highly variable. Very high QA values come from concentric circle method and probably mean other spots are being included.
• S.morph lowers background.
![Page 81: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/81.jpg)
Comparing image analysis methodsFo
regr
ound
– B
ackg
roun
d
Background
(A)
Fore
grou
nd –
Bac
kgro
und
Background
(B)
(A) Morphological background adjustment method of Spot (S.morph)
(B) QA.fix (Often gives increaseing BG estimates)
Only values from the lower half of the foreground intensity distribution are displayed.
http://www.stat.berkeley.edu/users/terry/zarray/Html/image.html
![Page 82: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/82.jpg)
Data
8 AI knockouts (Cy5)
1 Reference (Cy3)
8 Normals (Cy5)
1 Reference (Cy3)
Ref: pooled cDNA from 8 normals
6384 probes
257 (~4%) known to be related to lipid metabolism
From Callow, et al. 2000, Dudat, et al. 2002
![Page 83: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/83.jpg)
Comparison of t-denominators of different methods
http://stat-www.berkeley.edu/users/terry/zarray/Talks/image/image102000.ppt
Comparison of the t-denominators (estimating between slide variability) for different image analysis methods in the apo AI experiment.
![Page 84: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/84.jpg)
Comparison of t-values of different methods
http://stat-www.berkeley.edu/users/terry/zarray/Talks/image/image102000.ppt
Gap between p-values for 8 known: S.nbg and S.valleyDE genes largest for SA, S.morph, and S.const
![Page 85: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/85.jpg)
Comparing image analysis methods
• Morphological opening provides lower estimates of background than other methods.
─ M.O. estimates are less variable than other approaches.
─ Accuracy (assessed by finding known DE genes) was not compromised.
• In terms of finding DE genes,
Spot ScanAlyze
GenePix QuantArray
![Page 86: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/86.jpg)
Comparing image analysis methods
• Choice of intensity estimation method has larger impact on log intensity ratios than segmentation method.
• Means or medians over large neighborhoods can be noisy.
• No background adjustment results in decreased ability to find DE genes.
• Recommend morphological opening method.*
* No comments on false positives or false negatives
![Page 87: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/87.jpg)
Considerations following cDNA array intensity estimation
• Dye bias
• Print tip effects
• Spatial effects
• Array effects
Normalization attempts to minimize the effect of these systematic variations, making substantive differences easier to find.
![Page 88: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/88.jpg)
Simple Problem
One array (A1) is brighter than a second array (A2) and you would like to compare the two.
1
2log GR
2
2log GR
1
2log GR
22log G
R
![Page 89: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/89.jpg)
Simple solution
• Scale intensities to have the same mean or median.
• Problems with this?
• Assume “shift effect” is constant across array.
• Doesn’t account for spatial effects.
Yang, et al., 2002
cGR
GR
12
*
12 loglog
![Page 90: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/90.jpg)
Methods?log2(R/G) log2(R/G) - c = log2{R / (kG)}
Standard Practice (in most software)
c is a constant such that normalized log-ratios have zero mean or median.
Our Preference:
c is a function of overall spot intensity and print-tip-group.
What genes to use?• All genes on the array
• Constantly expressed genes (house keeping)
• Controls
– Spiked controls (e.g. plant genes)
– Genomic DNA titration series
• Other set of genes
Within-slide normalization
http://www.stat.berkeley.edu/users/terry/zarray/Html/normspie.html
![Page 91: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/91.jpg)
M vs. A
GRA
GRM
2
2
log
log
http://www.stat.berkeley.edu/users/terry/zarray/Html/normspie.html
![Page 92: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/92.jpg)
• Assumption: Changes roughly symmetric
• First panel: smooth density of log2G and log2R.
• Second panel: M vs. A plot with median set to zero
Normalization - Median
http://www.stat.berkeley.edu/users/terry/zarray/Html/normspie.html
![Page 93: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/93.jpg)
Normalization – lowess
http://www.stat.berkeley.edu/users/terry/zarray/Html/normspie.html
• Global lowess• Assumption: changes roughly symmetric at all intensities.
![Page 94: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/94.jpg)
Build the smooth function S(x) pointwise:
1. Take a point, x0. Find K nearest neighbors of x0 (N(x0)). The number of neighbors K is determined by user—specifies some percentage of the total number of points (they use 40%).
2. Calculate
3. Assign weights to N(x0) points.
4. Calculate weighted least squares fit of y on N(x0). Take
5. Repeat . . .
xxxxNx
00
0
max
00ˆ xSy
![Page 95: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/95.jpg)
Instead of
Use
Where c(·) is the smooth curve through the M-A plot (lowess fit to the M-A plot)
Recall:
Yang, et al.
cGR
GR 2
*
2 loglog
GRA
GRM
2
2
log
log
cGR
GR
2
*
2 loglog
![Page 96: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/96.jpg)
Assumption: For every print group, changes roughly symmetric at all intensities.
Normalization – print-tip-group
http://www.stat.berkeley.edu/users/terry/zarray/Html/normspie.html
![Page 97: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/97.jpg)
M vs. A – after print-tip-group normalization
http://www.stat.berkeley.edu/users/terry/zarray/Html/normspie.html
![Page 98: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/98.jpg)
Within print-tip-group normalization is reasonable when:
1. Only a relatively small proportion of the genes will vary significantly in expression between the 2 MRNA samples
or
2. There is symmetry in the expression levels of the up/down regulated genes.
3. There is no correlation between groups of DE genes and print tips
![Page 99: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/99.jpg)
• Consider location normalized intensities for print tip group i.
• Suppose
• Can get estimates of ai’s and adjust.
icGR
GR
2
*
2 loglog
group print tip ofeffect :
ratios-log trueof variance:
,,0~2
2
thi
i
ia
aNX
![Page 100: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/100.jpg)
Assumptions:• All print-tip-groups have the same spread.
• True ratio is ij where i represents different print-tip-groups, j represents different spots.
• Observed is Mij, where Mij = ai ij and
• Robust estimate of ai is
where MADi = medianj { |yij - median(yij) | }
II
i i
i
MAD
MAD
1
Taking scale into account
http://www.stat.berkeley.edu/users/terry/zarray/Html/normspie.html
0log1
2
I
iia
![Page 101: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/101.jpg)
Within print-tip-group box plots for print-tip-group normalized M
http://www.stat.berkeley.edu/users/terry/zarray/Html/normspie.html
![Page 102: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/102.jpg)
Effect of location + scale normalization
http://www.stat.berkeley.edu/users/terry/zarray/Html/normspie.html
![Page 103: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/103.jpg)
Problem
• If differences in scale were due largely to DE genes, adjusting for scale might mask your ability to find those genes.
• Again, if few genes are unexpected to be DE, this might not be an issue.
![Page 104: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/104.jpg)
Alternative method
• hidden
• where ci(·) is determined by both genes in the ith print-tip-group and other genes.
• “Composite” normalization uses MSP (titration series) genes.
• Could also use other housekeeping genes.
icGR
GR
2
*
2 loglog
![Page 105: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/105.jpg)
Comparing different normalization methods
http://www.stat.berkeley.edu/users/terry/zarray/Html/normspie.html
![Page 106: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/106.jpg)
Summary
• Print tip normalization works well under two assumptions:
1. MSP genes have minimal sample specific bias and can cover wide intensity range. Composite normalization necessary with divergent samples.
2. Adjusting for scale might compromise ability to find DE genes. Could have opposite effect (false positives).
• Kerr, et al. Wolfinger, et al. perform only global normaliztion.
* maanova has extra normalization options
![Page 107: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/107.jpg)
Recall Affy Measures of expression
• GeneChip® older software uses Avg.diff
with A a set of suitable pairs chosen by software.
• Log PMi / MMi was also used.
i
ii MMPMdiffAvg )(1
.
http://stat-www.berkeley.edu/users/terry/Classes/s246.2002/Week16/week16.ppt
![Page 108: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/108.jpg)
Affy Measures of expression
http://stat-www.berkeley.edu/users/terry/Classes/s246.2002/Week16/week16.ppt
GeneChip® newest version (MAS 5.0) uses something else, namely
with CT a version of MM that is never bigger than PM. Here TukeyBiweight can be regarded as a kind of robust/resistant mean.
)}{log(log ii CTPMghtTukeyBiweiSignal
![Page 109: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/109.jpg)
Affy Measures of expression
Rules to determine CT (change threshold) for each probe pair:
1. MM < PM CT = MM
2. MM ≥ PM :
i
i
CTPMSignal
CTPMSignal
logTBlog
logTBexp
A. If MM < PM for most probe pairs, an adjusted MM value is used based on bi-weight mean of ratio
B. If MM ≥ PM for most probe pairs, MM is replaced with a value that is “slightly smaller” than PM
MMPM
![Page 110: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/110.jpg)
Affy Measures of Expression
Determine weights:
• Calculate median of log(PM-CT) values across the probe set.
• Probe pair weights are determined by distance to median. Closer pairs get higher weights.
![Page 111: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/111.jpg)
Affymetrix Normalization and Scaling (pre MAS 5.0)
Global Normalization: Baseline array is chosen out of a set of arrays. Average Intensity* for this array is calculated. Intensities on any other (non-baseline) array A1 in the set are multiplied by normalization factor (NF) to make Average Intensity* of A1 equal to Average Intensity* of baseline array.
Global Scaling: Target intensity is chosen and each array in a set of interest is scaled by some factor (SF1, SF2, ..., SFN) to give Average Intensity* equal to target intensity.
Average Intensity*: Average of Average Difference values for every probe set except highest and lowest 2%.
![Page 112: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/112.jpg)
Affymetrix – Quantifiying gene expression levels
Li and Wong brought attention to the fact that AvDiff, as a measure of expression, has not been studied extensively. They proposed:
Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection
Cheng Li and Wing Hung Wong* PNAS, January 2001, 98:1, p. 31-36.
* Ph.D. student of Grace Wahba, University of Wisconsin-Madison, graduated in 1980
* Recipient of COPSS Prize
— Li and Wong, text, 2003
![Page 113: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/113.jpg)
Model-based analysis of oligonucleotide arrays
Consider I array samples and one gene:
Goal: Estimate the abundance level of the gene in the I samples.
Data: There are 2×I×20 measurements used to obtain estimates (I×20 PMs and I×20 MMs).
θi : Denotes “expression index” for gene in the ith sample.
Assume: Measured intensity is proportional to θi and proportionality constant depends on probe (indexed by j).
What is PMij = βjθi ?
20 features
i = I
20 features
i = 1
Li and Wong, 2001
![Page 114: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/114.jpg)
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
For MM, denote the proportionality constant by αj
For PM, denote the proportionality constant by βj
νj: baseline response for jth probe pair due to nonspecific hybridization
αj: rate of increase of MM response for jth probe
ϕj : additional rate of increase in PM response
ijjijij
ijjijij
PM
MM
jjj PM intensity increases at a higher rate than MM intensity (β > α)
![Page 115: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/115.jpg)
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
Fig. 1. Black curves are the PM and MM data of gene A in the first six arrays. Light curves are the fitted values to model 1. Probe pairs are labeled 1 to 20 on the horizontal axis.
![Page 116: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/116.jpg)
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
Recall the model:
Currently, there is a “strong preference” to base all computations on y = PM – MM for each probe pair. Subtracting the deterministic portions of the equations above gives:
ijjijij
ijjijij
PM
MM
jjj
ijjiijijij MMPMy
![Page 117: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/117.jpg)
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
Consider
with
Assume this identifiability constraint:
Fix and fit for using least squares.
Fix at and fit for using least squares.
Iterate
ijjiijijij MMPMy
),0(~ 2 Nij
Jj
j 2
~
~ ~
~
![Page 118: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/118.jpg)
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
Fig 2. Black curves are the PM-MM difference data of gene A in the first six arrays. Light curves are the fitted values to model 2.
![Page 119: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/119.jpg)
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
Fig 3. Plots of residuals (y axis) versus fitted value (x axis) for additive model (A) and multiplicative model (B).
(A)
(B)ijjiijy
ijjiijy
![Page 120: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/120.jpg)
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
Consider model for one array:
Suppose ϕ’ s are obtained from many arrays. Treat them as known.
Given ϕ’s, the LS estimate for θ is
jjjy
J
yyj jj
j j
j jj
2ˆ
j
jjj
j Jy
J21
E1ˆE
J
yJ jj
j
22
2Var
1ˆVar
![Page 121: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/121.jpg)
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
• Regarding θ’s as fixed, one can proceed similarly to get estimation and standard errors for .
• Note: A conditional analysis is done here. This assumes certain effects are known.
• In practice, the effects are estimated. The uncertainty in this estimation is not considered when computing standard errors.
• What is ?
~
2
2
1
2ˆ1
11ˆSE
jjj yy
JJ
![Page 122: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/122.jpg)
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
Recall: θi Denotes “expression index” for gene in the ith sample.
Question: Given ´s and SE[ ]´s, how would you use them?
Recall, for one array:
After fitting the model you would have:
jjjy
I ˆ , . . . , ˆ ,ˆ21
I ˆSE , . . . , ˆSE , ˆSE 21
![Page 123: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/123.jpg)
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
Fig 4. (A) Six arrays of probe set 1,248. (B) Plot of standard error (SE, y axis) vs. θ. The probe pattern (black curve) of array 4 is inconsistent with other arrays, leading to unsatisfactory fitted curve (light) and large standard errors of θ4.
Black curves are PM-MM data. Light curves are fitted model.
![Page 124: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/124.jpg)
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
Recall: ϕj denotes theadditional rate of increase (in excess of the MM rate) in PM intensity for probe j.
Question: Given ´s and SE[ ]´s, how would you use them?
Recall, for one array:
After fitting the model you would have:
jjjy
1 2 20
![Page 125: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/125.jpg)
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
Fig 6. (A) Probe 17 of probe set 1,222 is not concordant with other probes (black arrows) and is numerically identified by the outstanding standard error ϕ17 (B) Plot of standard error (SE, y axis) vs. ϕ.
Black curves are PM-MM data. Light curves are fitted model.
![Page 126: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/126.jpg)
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
Fig 7. (A) Probe set 3,562 has a single high-leverage probe 12, and the fitted light curves almost coincide with the black data curve. (B) ϕ12 is large compared with the other ϕ’ s close-to-zero value. Note that Affymetrix’s superscoring method works here by consistently excluding this probe.
Black curves are PM-MM data. Light curves are fitted model.
![Page 127: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/127.jpg)
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
Li and Wong note that
“the MM responses do contain information on the expression index, and that this information can only be recovered by analyzing the PM and MM responses separately.”
![Page 128: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/128.jpg)
Processing Probe Level Data
• A number of expression summary measures are obtained using PM and MM probes intensities.
• Recent results suggest that MM may be detecting signal along with PM.
• If this is the case, using MM could introduce noise and give biased estimates of the nominal expression level.
![Page 129: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/129.jpg)
PM and MM values for 20 probes from 12 spike-in arrays from varying concentration experiment plotted vs. concentration
MM May Be Tracking Signal
![Page 130: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/130.jpg)
• Some researchers suggested to use other MM sequences in order to alleviate this tracking.
• MM could be created by changing more than one base in PM sequence and by placing MM bases in different positions in the MM sequence (Nimblegen chips).
MM May Be Tracking Signal - What to do ?
![Page 131: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/131.jpg)
• Other researchers suggest only using PM (Robust Other researchers suggest only using PM (Robust Multiarray Average) Multiarray Average)
• This approach would allow space currently used This approach would allow space currently used for MM to be used for other PM, thus allowing for for MM to be used for other PM, thus allowing for twice as many sequences of interest to be printed twice as many sequences of interest to be printed onto an array.onto an array.
MM May Be Tracking Signal - What to do ?
![Page 132: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/132.jpg)
• Bolstad, et al., Bioinformatics, 2003
• Irizarry, et al., Biostatistics, 2003
• Irizarray, et al., text, 2003
• Irizarray, et al., NAR, 2003
Robust Multi-array Average (RMA)
![Page 133: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/133.jpg)
http://stat-www.berkeley.edu/users/terry/Classes/s246.2002/Week16/week16.ppt
OVERVIEW
Uses only PM (ignores MM)
• Adjust for background on the raw intensity scale
• Take log2 of background adjusted PM
• Carry out quantile normalization of log2(PM-BG), with chips in suitable sets
• Conduct a robust multi-array analysis (RMA) of the quantities
RMA: Measures of Expression
![Page 134: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/134.jpg)
Robust Multi-array Average (RMA)
RMA
Background correct, normalize, and log2 the PM intensities. Call this transformation T.
ei = log2 expression on ith array
aj = log2 probe effect for probe j
ijjiij aePM T
2,0~ Nij
![Page 135: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/135.jpg)
Robust Multi-array Average (RMA)
Recall dChip:
RMA
NAR 2003:
“[Our model] is quite different from the additive model in PM-MM that was found unsatisfactory in Li and Wong, most likely because of the very strong mean variance dependence that would be present in such an additive model.”
ijjiij aePM T
ijjijijPM
ijjiijijij MMPMy
![Page 136: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/136.jpg)
Why we take log2
http://biosun01.biostat.jhsph.edu/~ririzarr/Talks/nci-2002.ppt.gz
![Page 137: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/137.jpg)
1.25 2.5 5 7.5 10 20 g
LIVER
CNS
12,626 genes
Dilution Study (www.genelogic.com)
5 reps
5 reps
30 arrays
30 arrays
![Page 138: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/138.jpg)
Comparing RMA and MAS 5.0
• Precision of expression estimates (estimated by SD of replicate arrays)
• Consistency of fold change estimates
• Specificity and sensitivity (different methods used to assess DE genes)
![Page 139: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/139.jpg)
• Normalization is done within replicate groups. The assumption that most genes do not change across non-replicate groups does not hold here.
(note that two different normalization methods were used: quantile for RMA and affy.scale.value for MAS)
• Expression measures for RMA and MAS Signal 5.0 were estimated using rma and expresso functions of Bioconductor package Affy
Comparing RMA and MAS 5.0 – Normalization
![Page 140: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/140.jpg)
RMA
![Page 141: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/141.jpg)
MAS 5.0
![Page 142: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/142.jpg)
• Squared correlation coefficient across replicates was calculated over all 120 pairs of replicates ( per group of replicates)
RMA MAS 5.0 SignalR2 0.9947 0.9917
Strong probe affinity implies R2 ≈ 1.
• The difference was significant (p-value 1.152560e-07)
52
Methods and Results
![Page 143: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/143.jpg)
• SD across replicate arrays were computed for all genes.
• LOESS curves were fitted to scatter plot of SD versus mean expressed values.
Methods and Results
![Page 144: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/144.jpg)
Loess curve of SD across replicates for all genes RMA measures
Expression
SD
acr
oss
repl
icat
es MAS 5.0
RMA
![Page 145: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/145.jpg)
• For one gene, fit line to expression estimate vs. concentration on the log-log scale. Then calculate the “Average lines” (average 's across genes).
• Since every fold increase in concentration should have the same fold increase in expression measure, a line fitted on log-log scale should have slope 1.
Consistency of fold change
65.0ˆ 5.0 MAS
67.0ˆ RMA
![Page 146: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/146.jpg)
• Consistency of fold change was examined by comparing fold change estimates between arrays with different concentrations of target mRNA.
• Slopes over all genes for two different conditions of average expression versus concentration were calculated and on average were:
RMA MAS Signal 5.0liver tissue 0.53 0.53CNS samples 0.56 0.59
Consistency of fold change
![Page 147: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/147.jpg)
• Fold change between CNS and Liver tissue were
estimated for all genes using 10 arrays in the lowest and 10 arrays in highest concentration group.
• Number of genes showing inconsistency of fold change estimate by at least 2-fold;
RMA 23
MAS 5.0 81
Consistency of fold change
![Page 148: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/148.jpg)
Irizarry, et al., 2003
RMA fold change estimate for 20 vs. fold change estimate for 1.25 g
![Page 149: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/149.jpg)
MAS 5.0- fold change estimate for 20 vs. fold change estimate for 1.25 g
Irizarry, et al., 2003
![Page 150: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/150.jpg)
• In general it appears that RMA has better
precision and similar accuracy as MAS Signal 5.0.
• RMA had slightly better consistency of fold change estimate.
Conclusions
![Page 151: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/151.jpg)
• 11 control cRNA’s spiked in at different
concentrations on each array. Other genes should be same across arrays.
• Choose 10 pairs of arrays from spike-in experiment.• Compute FC for each gene under RMA, dChip, MAS
5.0.• For some cut-off C, compute proportion of non-spiked
genes where FC > C, (false positives) and proportion of spiked genes where FC > C (true positives).
Specificity and sensitivity – Spike-in data
![Page 152: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/152.jpg)
Irizarry, et al., 2003
Fold change for Affymetrix Spike-in experiment
![Page 153: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/153.jpg)
Irizarry, et al., 2003
Test Statistic for Affymetrix Spike-in experiment
![Page 154: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/154.jpg)
http://www.bioconductor.org/workshops/JAX02/jax-B.pdf
![Page 155: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/155.jpg)
• Overall RMA does better than Li & Wong (dChip),
which in turn does better than MAS 5.0 using FC.
• The simple t = est log FC / SE(est log FC) seems best for use with MAS and RMA.
• MAS looks bad here because we use single chip summaries in our analysis. They need a multi-chip version of their Signal Log Ratio. When done, it will look like the final step in RMA.
• With RMA and Li & Wong, nominal SEs are not as good as observed ones and p-values are better than (log) fold change.
Conclusions from replicate chip ROC curves
![Page 156: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/156.jpg)
Figure 5. Box plots showing the distribution of observed fold changes for non-spiked in genes. The different colors represent the different quantiles. The relationship of color and quantile is demonstrated in the first box from the left.
Log Fold Change of Non-Differentially-Expressed Genes
Irizarry, et al., 2003
![Page 157: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/157.jpg)
Conclusions from single chip comparison ROC curves
http://www.bioconductor.org/workshops/JAX02/jax-B.pdf
• On the basis of the data just presented, and much more:
• With FC, RMA is best, LW (Li & Wong) next. MAS does not do well here.
• With p-values, RMA is a good as, and usually better than MAS, which is next. MAS does best on Affymetrix spike-in data sets. LW (dChip) does not do so well here.
• All judgments are comparative. Everyone does well in absolute terms, but some do better.
![Page 158: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/158.jpg)
• In general it appears that RMA has better In general it appears that RMA has better
precision and similar accuracy as MAS Signal 5.0precision and similar accuracy as MAS Signal 5.0
• RMA had slightly better consistency of fold RMA had slightly better consistency of fold change estimatechange estimate
More Conclusions
![Page 159: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/159.jpg)
Comment on MM
Irizarry, et al., 2003
NAR, 2003:
“It is possible that information about non-specific binding is contained in the MM values, but empirical results demonstrate that mathematical subtraction does not translate to biological subtraction. We have found that, until a better solution is proposed, simply ignoring these values is preferable.”
![Page 160: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/160.jpg)
METHODS TO IDENTIFY DE GENES
![Page 161: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/161.jpg)
Mult-t
Statistical Methods for Identifying Differentially ExpressedGenes in Replicated cDNA Microarray Experiments
byDudoit, Yang, Callow, and Speed
Statistica Sinica 12 (2002), 111-139.
**Additional Details in Parmigiani et al., 2003.
![Page 162: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/162.jpg)
Mult-t : Outline
• Data: AI Knockout & SRBI transgenic mice. AI, SRBI are two genes invovled in HDL metabolism.
• Image: Segmentation and background correction (Yang et al.).
• Normalization: Spatial and intensity dependent effects.
• Gene summary: Construction of t-statistic for each gene. Evaluation of the statistic at a gene uses only data at that gene.
•Hypothesis test at each gene (accounts for multiple tests).
![Page 163: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/163.jpg)
Mult-t : Normalization
Use lowess ( ) to identify curves through points grouped by print tips.
(log2 R/G)’ = (log2 R/G) - cj(A)
cj(A) is lowess ( ) fit to M vs. A for print tip j.
![Page 164: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/164.jpg)
Mult-t : Gene Specific Summary
Compute Welch t-statistic for every gene
• Tj and tj: Random variable and realization of random variable for every gene j.
• Hj0: j th null is true.
• Hj1: j th null is false.
2
22
1
21
12
ns
ns
xxt
jj
jjj
![Page 165: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/165.jpg)
Mult-t : Hypothesis tests
Given evaluated test statistics, which are unusually large in magnitude ?
Informal assessment:
QQ plots
MA plots
Other (numerator vs. denominator of t-stat)
More precise assessment:
For Hj, pj = P( | Tj | > | tj | | H j 0) and determine how small pj
should be so that you reject given many (m) tests are done.
![Page 166: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/166.jpg)
Mult-t : Hypothesis tests
1 2 . . . . . . n12...m
genes
samples
j = 1, 2,…, m genes (6384)
i = 1, 2, …, n samples (16); n1 + n2 = n (n1 = n2= 8)
Xji = log2 (Rji/Gji) is relative (transformed, normalized, and background corrected) expression level for jth gene on ith array.
Xji
![Page 167: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/167.jpg)
Mult-t : Hypothesis tests
H1: 11=12
H2: 21=22
:Hm: m1=m2
1 2 . . . . . . n12...m
genes
samples
t1
t2
. . .tm
H1
H2
. . .Hm
![Page 168: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/168.jpg)
Determine distribution of test statistics under null
For n reasonably large,
T-stat ~ tv
![Page 169: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/169.jpg)
Mult-t : Determine distribution of test statistics under null
Since n is generally not large in microarray experiments, build up distrubtion of test statistics under the null via permutation.
1 2 . . n1 n1+1 . . n2
12...m
samples
1 2 . . . . . B
permutations
t11
t21
. . .tm1
12...m
t1B
t2B
. . .tmB
Pj* = (1/B) ( | tj,b | > | tj | )
![Page 170: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/170.jpg)
Mult-t : Notes on Permutations
Computationally, getting the distribution of test statistic via permutations is reasonable. Getting the distribution of the p-values might not be.
If you have, say, 6 samples total (3 in each group), what’s the smallest p-value you could obtain via permutations ?
![Page 171: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/171.jpg)
Mult-t : Adjusting for multiple tests
• Family Wise Error Rate (FWER): probability of at least one type I error for all tests considered.
• Goal: Control FWER
Strong Control: Control for any combination of true and false nulls.
Weak Control: Control for the complete null (all nulls true).
![Page 172: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/172.jpg)
Mult-t : Adjusting for multiple tests
• Procedures to control FWER:
Bonferonni:
Reject Hj if pj < m
pj* = min (pjm, 1)
Sidak:
pj* = 1 - (1-pj)m
Westfall & Young:
pj*’s obtained via reordering of permutation matrix.
![Page 173: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/173.jpg)
Mult-t : Westfall and Young’s Procedure
Order observed t-statistics: | t rm | < | t rm-1 | < … < | t r2 | < | t r1 |
1 2 . . . . . B
t11
t21
. . .tm1
12...m
t1B
t2B
. . .tmB
reorder
u 1,1
. .um-1,1
u m ,1
u 1,B
. .um-1,B
u m,B
u m, b = | t rm, b |
u m-1, b = max (u m, b, | t rm-1, b |)
:
u 1, b = max (u 2, b, | t r1, b |)
![Page 174: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/174.jpg)
Mult-t : Westfall and Young’s Step Down Max T Procedure
Order observed t-statistics: | t rm | < | t rm-1 | < … < | t r2 | < | t r1 |
1 2 . . . . . B
t11
t21
. . .tm1
12...m
t1B
t2B
. . .tmB
reorder
u 1,1
. .um-1,1
u m ,1
u 1,B
. .um-1,B
u m,B
Prj* = (1/B) ( | uj,b | > | trj | )
...and enforce monotonicity
![Page 175: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/175.jpg)
Mult-t : Westfall and Young’s Step Down Max T Procedure
• Less conservativee than Bonferonni, Sidak, Holm’s
• Provides Strong Control of FWER
• Max T = Min P when the t-statistics are identically distributed. Generally, this is not the case; and, again, the minP algorithm is more computationally intensive.
![Page 176: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/176.jpg)
Mult-t : Data
Experiment 1
8 AI Knock outs (Cy 5)
1 Reference (Cy 3)
8 Normals (Cy 5)
1 Reference (Cy3)
8 SRBI Transgenics (Cy 5)
1 Reference (Cy 3)
8 Normals (Cy 5)
1 Reference (Cy3)
Experiment 2
6382 probes
257 (~4%) related to lipid metabolism
Reference: Pooled cDNA from 8 normals
Q: What does this mean for permutation tests ?
![Page 177: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/177.jpg)
Mult-t : Histogram and QQ plot of t-statistics
![Page 178: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/178.jpg)
Mult-t : Max T adjusted and unadjusted p-values
![Page 179: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/179.jpg)
Comments on Mult-t
• Welch’s t-statistic is used. Welch proposed solution to Behrens-Fisher problem. Implicit assumptions guide choice of the test statistic even though “no assumptions are made regarding distribution of the test statistics”.
• Permutations are advantageous for a number of reasons, but do not provide useful results when sample sizes are small.
• Permutation test not valid for this experimental design.
• Method is compared with single slide methods !
• Page 132, Newton et al. ... false positives... Consider definition of a false positive here !
![Page 180: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/180.jpg)
Mult-t : Comparison of methods
SRBI data. Newton et al (orange); Chen et al. (purple)
![Page 181: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/181.jpg)
Analysis of Variance for Microarrays (ANOVA)
Analysis of Variance for Gene Expression Microarray Data
byKerr, Martin, and Churchill
Journal of Computational Biology 7: 819-837, 2000.
Bootstrapping Cluster Analysis: Assessing the Reliability of Conclusions from Microarray Experiments
byKerr and Churchill
PNAS 98 (16): 8961 - 8965, 2001.
**Additional Details in Parmigiani et al., 2003.
![Page 182: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/182.jpg)
ANOVA : Outline
• Data: Human liver and human muscle tissue hybridized to two cDNA arrays. Final data set had 1286 spots.
• Normalization via terms in ANOVA model (“global analysis”)
• Gene summary: Construction of statistic for each gene. Evaluation of the statistic uses data from that gene (“local analysis”).
• Hypothesis test at each gene (uses bootstrap; does not account for multiple tests).
![Page 183: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/183.jpg)
ANOVA: Model Development
Liver
Liver
Muscle
Muscle
1
1 2
2
Array
cDNA
A; i indexes array ( i=1,2 )
D; j indexes dye ( j=1,2 )
V; k indexes variety ( k = 1,2 )
G; g indexes gene (g = 1, 2, ..., N = 1286)
![Page 184: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/184.jpg)
ANOVA : Model
log(yijkg) = + Ai + Dj + Vk + Gg +(AG)ig +(VG)kg+eijkg
m - overall average signal (*)
A - array (*)
D - dye (*)
V - variety (i.e., condition or tissue)
G - gene
AG - array by gene interaction (spot effect)
VG - variety by gene interaction (DE if VG1gVG2g)
* Normalization
![Page 185: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/185.jpg)
ANOVA : Gene Specific Summary
Obtain parameter estimates via least squares
is of most interest when goal is to identifyDE genes.
Source df SS MS
Array 1 92.34 92.34
Dye 1 0.74 0.74
Variety 1 2.97 2.97
Gene 1285 1885.89 1.47
AG 1285 160.01 0.12
VG 1285 1357.28 1.06
Residual 1285 82.75 0.0644
Corrected Total
5143 3581.99
(Table 3, page 23)
21
VGVG
![Page 186: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/186.jpg)
ANOVA : Hypothesis tests
Given evaluated test statistics, which are unusually large in magnitude ?
Informal assessment:
Plots of
More precise assessment:
Bootstrap to obtain confidence intervals for VG1-VG2.
21
VGVG
![Page 187: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/187.jpg)
ANOVA : Bootstrap to Identify DE genes
Calculate Residuals
; distribution of residuals => f
Sample from C f to get b*
scale factor ensures that empirical distribution has variance equal to true residuals (Wu, 1986, Annals of Statistics).
Simulate Data
Fit the model to simulated data and calculate
**
21 ggVGVG
)log()log( ijkgijkg yy
** )log()log( bijkgijkg yy
![Page 188: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/188.jpg)
ANOVA : Comments on ANOVA Approach
No adjustments for multiple tests !
The authors state “this may or may not be necessary based on the intended purpose of the analysis” (page 8).
ANOVA modelling framework provides a method of normalization* by accounting for array, dye, gene, ... effects.
Residual distribution on log scale is non-normal, but constant error variance assumption is not grossly violated.
![Page 189: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/189.jpg)
Significance Analysis of Microarrays (SAM)
Significance Analysis of MicroarraysApplied to the Ionizing Radiation Response
byTusher, Tibshirani, and Chu
PNAS 98 (9): 5116-5121, 2001.
**Additional Details in Parmigiani et al., 2003.
![Page 190: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/190.jpg)
SAM : Outline
• Data: 2 wild type human lymphoblastoid cell lines (1,2) harvested in unirradiated or irradiated (U,I) state 4 hours after treatment. RNA samples were labelled and divided into two identical aliquots (A,B) prior to hybridization onto Affy chips. (U1A, U1B, U2A, U2B, I1A, I1B, I2A, I2B).
• Normalization via reference set obtained from average of intensity values across subsets of arrays.
• Gene summary: Construction of statistic for each gene. Evaluation of the statistic uses data from the entire array.
• Hypothesis test at each gene (accounts for multiple tests).
![Page 191: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/191.jpg)
SAM : Normalization
• Generate reference set by averaging each gene expression level across the 8 hybridizations.
• Cube root scatter plot intensity values from each data set against reference (this handles negatives and Tusher et al. report that it resolved vast majority of lowly expressed genes).
• A linear least squares fit to the cube root scatter plot is used to calibrate each hybridization.
![Page 192: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/192.jpg)
SAM : Post Normalization
![Page 193: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/193.jpg)
SAM : Gene Specific Summary
The relative difference measure d j for gene j:
To ensure that the variance of d j is independent of gene expression, s0 (a small positive constant) is added to the denominator.
PNAS manuscript: The coefficient of variation of d j was computed as a function of s j in moving windows across the data and s0 was chosen to minimize the CV.
Parmigiani et al. text: “adaptively chosen”. Taken as median of all s (i).
0
21
ss
xxd
j
jjj
![Page 194: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/194.jpg)
SAM Procedure to Identify DE Genes
1 2 3 4 5 6 7 8
12...m
samples
1 2 . . . . . B=36
permutations
d11
d21
. . .dm1
12...m
d1B
d2B
. . .dmB
To minimize potentially confounding effects between the two cell lines, they analyzed data by using 36 balanced permutations. A permutation is considered balanced for cell lines 1 and 2 if each group of 4 experiments contained two experiments from cell line 1 and 2 from cell line 2.
![Page 195: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/195.jpg)
SAM Procedure to Identify DE Genes
1 2 . . . . . B=36
permutations
d11
d21
. . .dm1
12...m
d1B
d2B
. . .dmB
order columns
1 2 . . . . . B=36
permutations
d(1)1
d(2)1
. . .d(m)1
12...m
d(1)B
d(2)B
. . .d(m)B
dE,j = (1/36) d ( j )b
![Page 196: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/196.jpg)
SAM Procedure to Identify DE Genes
Plot observed, ordered, d ( j ) against d E,j
d E,j
d ( j )
2
u
l
* u need not equal | l |
DE genes
DE genes
![Page 197: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/197.jpg)
SAM: Estimate the False Discovery Rate (FDR)
Example from Tusher et al. 2001. 46 genes identified as DE using =1.2. For permutation 1, figure out how many genes you would have rejected using this and assuming dj1 ( j = 1,2,...,m) is data.
Repeat for every permutation and calculate average number of false positives. Average = 8.4 => FDR = 8.4 / 46 = 0.183 (18.3%).
d E,j
d ( j )1
u
l
5 FD’s for this set.
![Page 198: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/198.jpg)
SAM: Defining s0 and :
s0 is chosen to make CV of dj approximately constant as a function of sj.
This dampens large values of dj that arise from genes with very small sj.
Generally, a constant CV (or approximately constant CV) is assumed
in models of microarray data.
To determine , fix the type I error rate . Calculate
hatFDR1 , hatFDR2 , ... , hatFDRn, for n values of . Take smallest *
such that hatFDR* < .
There are other suggestions for calculating (~281 of
Parmigiani text 2003) that involve controlling the
pFDR. Control of pFDR is becoming more common.
![Page 199: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/199.jpg)
SAM: Comments by Tusher et al. 2001
• Dudoit et al. 2002 method (using step down max T) is too conservative. It found zero genes for this data set !
• 8 arrays are not enough for p-values based on permutations such as those done in Dudoit et al. 2002.
• SAM does not have strong or weak control of FDR.
• SAM estimates FDR. The estimate can be > 1 .
![Page 200: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/200.jpg)
SAM: Comments on Tusher et al. 2001
• They stress the well known problems that arise from using fold
change (nice reference to cite).
• The optimal way in which to determine s0 and are open
problems. They have been addressed. See Parmigiani text, 2003.
• Application of SAM methodology to more than two conditions
has not been evaluated. Utility will rely on construction of a good
statistic (that can be hard).
• Intuitive approach. Implemented in Excel and R.
• SAM determines and calculates FDR using the same data. This
could introduce a bias. See page ~282 of Parmigiani text, 2003.
![Page 201: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/201.jpg)
False Discovery Rate
Do not reject H0
Reject H0
H0 true
H0 false
U V
T S
m0
m-m0
mm-R R
FDR: E(Q) where Q = V/R (R > 0) and 0 (R = 0)
E(Q) = E (Q | R > 1) Pr (R > 1)
![Page 202: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/202.jpg)
Benjamini - Hochberg Procedure to Control the FDR
• Let P1,…,Pm denote the p-values from m tests.
• Order the p-values: P(1) P(2) P(m).
• Let k* = max{ k: P(k) (k)/m}
• Reject all the null hypotheses for which Pi P(k).
• This ensures FDR (m0/m)
• Result does not depend on m0 (the number of true nulls) or the distribution of p-values under H1.
![Page 203: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/203.jpg)
Benjamini - Hochberg Procedure to Control the FDR
slope
| | | . . . . . . . . |
1/m 20/m 40/m 1
Ord
ered
p-v
alue
s
k*/m k* = max {k : p(k) < (k)/m}
1<k<m
![Page 204: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/204.jpg)
Empirical Bayes for Microarrays (EBarrays)
Journal of Computational Biology 8: 37-52, 2001.
On Differential Variability of Expression Ratios:Improving Statistical Inference
About Gene Expression Changes from Microarray Databy
M.A. Newton, C.M. Kendziorski, C.S. Richmond, F.R. Blattner, and K.W. Tsui
On Parametric Empirical Bayes Methods for Comparing Multiple GroupsUsing Replicated Gene Expression Profiles
byC.M. Kendziorski, M.A. Newton, H. Lan and M.N. Gould
Statistics in Medicine, to appear, 2003.
**Additional Details in Parmigiani et al., 2003.
![Page 205: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/205.jpg)
EBarrays: Outline
•Data: E.coli under 3 treatments, 1 control; 4 cDNA arrays, ~4200 spots. Rat mammary glands from parentals and congenics; 24 Affymetrix chips, ~26,000 intensities.
•Model Development: Hierarchical Mixture Model accounts for known sources of variability.
• Normalization: EBarrays assumes data has been normalized for effects within and between arrays.
• Gene summary: Posterior probability of DE for each gene. Evaluation uses data across the entire array.
• Hypothesis test at each gene (“naturally” accounts for multiple tests).
![Page 206: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/206.jpg)
EBarrays: Data
E.coli K-12 cell lines: 4 samples labelled in red (control, IPTG-a, IPTG-b, HS) and 4 in green (all control).
![Page 207: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/207.jpg)
EBarrays: Data
10 Affy chips from non-treated; 14 from treated (DMBA)
![Page 208: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/208.jpg)
EBarrays: Model Development
),(~ ,ixi aGx
Measurement Error Actual Expression
),(~ ,iyi aGy
),(~, 0,, aIGiyix
![Page 209: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/209.jpg)
zi 1 if X ,i Y ,i
0 if X ,i Y ,i
)(B~ pZ
EBarrays: Model Development
),(~ ,ixi aGx
Measurement Error Actual Expression
),(~ ,iyi aGy
),(~, 0,, aIGiyix
![Page 210: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/210.jpg)
EBarrays: Model Fit
)|()|()|,( iiiiA ypxpyxp
dypxpyxp iiii )|()|()|,(0
0
k
kkkkkkkAkc pzpzyxpzyxpzpl )1(ln)1()(ln),(ln)1(),(ln, 0
E-step: 0)1(
,,1ˆpppp
pppyxzPz
A
Akkkk
M-step: Maximizing resulting form in . p,
ixixixii dpxpxp ,
0
,, )()|()|(
and iyiyiyii dpypyp ,
0
,, )()|()|(
![Page 211: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/211.jpg)
EBarrays: Model Diagnostics (Marginal Densities)
![Page 212: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/212.jpg)
EBarrays: Gene Specific Summary
DzP
DzP
i
i
0
1 odds
dpDpPyxpzPDzP iiii 1
0
,,11
p
p
yxp
yxp
ii
iiA
ˆ1
ˆ
),(
),( odds
0
![Page 213: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/213.jpg)
EBarrays: Contour Plots of Odds
![Page 214: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/214.jpg)
EBarrays: Model Diagnostics (Gamma QQ plots on 4 group comparison - DMBA treated)
![Page 215: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/215.jpg)
EBarrays: Model Diagnostics (CV plots on 4 group comparison - DMBA treated)
![Page 216: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/216.jpg)
EBarrays: Results on 4 group comparison (DMBA treated)
![Page 217: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/217.jpg)
Gene ID COP CI CII WF P0 P1 P2 P3
J00801 3066 4777 995 9083 0.05 0.95 0 0
0.04 0.96 0 0
L08100 4368 1278 14162 0 1 0 0
0 1 0 0
J00772 392 122 679 0.04 0.96 0 0
0.97 0.02 0.00 0.01
EBarrays: Results on 4 group comparison (DMBA treated)
![Page 218: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/218.jpg)
EBarrays: Threshold
• The rule “classify into the pattern of expression with the highest posterior probability” is the rule which minimizes the posterior expected number of false positives and negatives (under 0-1 loss).
• For two conditions, this is the same as “classify into the pattern of expression with posterior probability > 0.5 ”.
• EBarrays reports posterior probabilities; user can decide on threshold.
![Page 219: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/219.jpg)
EBarrays: Threshold (Meng Chen)
• 500,1000 and 2000 genes500,1000 and 2000 genes
• P(DE)=0.05, 0.1, …, 0.5P(DE)=0.05, 0.1, …, 0.5
• 2 conditions with 20 samples 2 conditions with 20 samples each.each.
• EBarrays, non-informative EBarrays, non-informative prior, 5 iterations, 0-1 loss.prior, 5 iterations, 0-1 loss.
• Each point is the average of Each point is the average of pFDR for 10 runs.pFDR for 10 runs.
• The plane is the Linear Model The plane is the Linear Model fit.fit.
• pFDR increases when P(DE) pFDR increases when P(DE) becomes larger, but is still becomes larger, but is still relatively low ! relatively low !
![Page 220: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/220.jpg)
EBarrays: Some Ideas on pFDR control (Meng Chen)
• In Bayesian framework, one chooses a loss function to specify the relative cost of a false positive to a false negative; then, a rule is derived to minimize the Bayes risk.
• As stated previously, EBarrays reports posterior probabilities; user decides on threshold (under 0-1 loss, rule is to take the pattern with the highest posterior probability).
•The posterior expected false discovery rate can be controlled by adjusting the threshold.
• For example, one can decide beforehand at what level to control the pFDR and then use the rule that controls it at that level.
•Which one makes more sense?
![Page 221: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/221.jpg)
EBarrays: Which one makes more sense ? (Meng Chen)
is the deciding point in EBarrays. Reject null if Pr(DE|data)>.
• 1000 genes, 2 conditions with 20 samples each. P(DE) = 0.2
• Increasing appears to decrease FDR.
• Large corresponds to bigger penalty to false positives, which makes sense.
![Page 222: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/222.jpg)
EBarrays: Which one makes more sense ? A Simulation Study (Meng Chen)
• Simulations were carried out to compare the risk resulting from EBarrays and BH.
• 1000 genes, 2 conditions of 10 samples each.
• ~ N(,1). = 2,3,4,5.
• The risk is a function of the true proportion of DE genes.
• Didn’t replicate the runs.
![Page 223: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/223.jpg)
EBarrays: Which one makes more sense ? Results (Meng Chen)
![Page 224: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/224.jpg)
• Hierarchical model was developed to identify significant differential expression.
• Model accounts for measurement error process and for natural fluctuations in absolute expression levels.
• Multiple conditions are handled in the same way as two conditions (no extra work required!).
• Threshold can be adjusted to target a specific pFDR.
• R-library available at www.biostat.wisc.edu/~kendzior/ (soon in Bioconductor).
• In addition to identifying DE genes, EBarrays provides improved (shrinkage) estimates of expression.
EBarrays: Comments on Empirical Bayes Approach
![Page 225: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/225.jpg)
Identifying DE genes: General Approach
• Use data to evaluate a test statistic (determine a gene specific summary) for every gene. Could use only data at that gene (Dudoit et al.) Could use data from all genes (Newton et al., Storey et al.)
• Evaluate method (model) used to generate test statistics. Were assumptions reasonable ? Does model fit well ? Does it provide additional information ?
• Perform hypothesis test at each geneDetermine threshold.Perhaps adjust for multiple tests.
![Page 226: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/226.jpg)
EBarrays: Shrinkage Estimates of Fold Change
The posterior distribution of true differential expression at a given spot:
)(2
)1(0
01
),,(aa
i
i
i
aaiiii x
yyxp
i
ii y
xˆ
Use marginal maximum likelihood to determine . ),,( 0 aa
![Page 227: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/227.jpg)
EBarrays: Shrinkage Plots
![Page 228: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/228.jpg)
EBarrays: Shrinkage Estimates Provide Error Reduction
![Page 229: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/229.jpg)
EBarrays: Shrinkage Estimates of Expression Re-rank Genes
![Page 230: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/230.jpg)
CLASSIFICATION AND CLUSTERING
![Page 231: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/231.jpg)
Problems to be Addressed via Classification or Clustering Methods
Unsupervised Learning: Identification of new groups of profiles or genes
Hierarchical clustering analysis SVD, PCA, K-means, Model Based Approaches,
SOM * (Golub et al. implementation uses both
unsupervised and supervised methods).
Supervised Learning: Classification into known classes (usually done on profiles)
Discriminant Analysis methods
Variable Selection: Identification of predictors (usually genes) that characterize known profile classes
![Page 232: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/232.jpg)
Why cluster microarray data ??
11.03 1.22 0.92 2.61 -0.29 1.31 1.15 0.54 1.98 10.700.00 10.61 2.40 2.16 0.60 -0.22 1.64 0.89 10.64 0.210.12 0.46 10.30 0.14 1.56 2.29 1.30 11.18 2.06 0.14-0.92 0.97 -0.11 10.78 2.57 2.26 6.64 1.39 1.22 2.130.59 0.29 0.30 2.14 9.83 10.42 -0.50 1.66 0.29 0.462.14 2.19 2.19 0.01 9.93 9.84 -0.57 -0.52 1.36 0.48-0.77 0.65 1.51 10.39 0.90 0.92 10.26 0.69 2.13 -0.100.35 0.07 7.66 0.18 0.70 0.51 2.24 9.99 2.26 -0.151.67 11.30 1.48 1.89 1.29 0.72 0.39 0.94 8.41 1.2011.26 0.43 2.12 1.10 1.40 1.48 2.04 0.96 0.93 8.55
![Page 233: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/233.jpg)
Why cluster microarray data ??
11.03 1.22 0.92 2.61 -0.29 1.31 1.15 0.54 1.98 10.700.00 10.61 2.40 2.16 0.60 -0.22 1.64 0.89 10.64 0.210.12 0.46 10.30 0.14 1.56 2.29 1.30 11.18 2.06 0.14-0.92 0.97 -0.11 10.78 2.57 2.26 6.64 1.39 1.22 2.130.59 0.29 0.30 2.14 9.83 10.42 -0.50 1.66 0.29 0.462.14 2.19 2.19 0.01 9.93 9.84 -0.57 -0.52 1.36 0.48-0.77 0.65 1.51 10.39 0.90 0.92 10.26 0.69 2.13 -0.100.35 0.07 7.66 0.18 0.70 0.51 2.24 9.99 2.26 -0.151.67 11.30 1.48 1.89 1.29 0.72 0.39 0.94 8.41 1.2011.26 0.43 2.12 1.10 1.40 1.48 2.04 0.96 0.93 8.55
Simple Answer: To recognize patterns that aren’t easy to see.
![Page 234: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/234.jpg)
Why cluster microarray data ??
Current methods for classifying human tumors rely on a variety of morphological, clinical, and molecular variables.
There are still uncertainties in diagnosis.
Existing tumor classes are most likely heterogeneous.
Microarrays may be used to characterize the molecular variations among tumors by monitoring gene expression profiles on a genomic scale.
This may lead to more reliable classification of tumors !!
Nice motivation provided by Dudoit et al., 2002, JASA (cancer is used to illustrate)
![Page 235: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/235.jpg)
Clustering Algorithms (Unsupervised)
“Model Free” algorithms (aka. combinatorial algorithms) directly assign observation to a group or model withoutconsideration of underlying probability model.
* Popular
* Intuitive
* (Fairly) Easy to Implement
“Model Based” algorithms: many assume data are i.i.d. from some population with pdf f where f is a mixture of component density functions. Each component describes one of the clusters. The model can be fit using ML or Bayesian methods.
![Page 236: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/236.jpg)
Model Free Algorithms
The most popular clustering algorithms directly assign each observation to a group or cluster without regard to a probability model describing the data.
Each observation is labelled i {1,2,...,N}
Each cluster is labelled k {1,2,...,K} (K < N)
Each observation is assigned to one (and only one) cluster.
C: i -> k ( C (i) = k )
Consider the distance d(xi, xi’) between every pair of observations.
Find C* that achieves some goal (i.e., minimize some summary function of the d’s).
![Page 237: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/237.jpg)
K - Means
Goal: Minimize W(C)Algorithm:
1. For a given C, calculate the means of each cluster {m1, m2, ..., mk}
2. Assign each data value to cluster with closest cluster mean
3. Repeat 1 & 2 until convergence.
p
jiijiijii xxxxxxd
1
2'
2'',
K
kii
kiC kiC
xxCW1
2'
)( )'(
2
1minarg ki
KkmxiC
![Page 238: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/238.jpg)
Comments on K - Means
1) Uses quantitative data values
2) No ordering of objects within a cluster.
3) Number of clusters, K, must be chosen in advance
4) As K changes, cluster membership can change in arbitrary ways. (i.e., clusters need not be nested).
5) The algorithm on the previous page guarantees convergence, but convergence may be to a local min.
![Page 239: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/239.jpg)
Comments on K - Means
(4). To choose K, oftentimes different data values are considered
K
Dis
tanc
e
(5). To identify if min is local, should start from many different configurations (never guaranteed).
![Page 240: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/240.jpg)
Final Note on K - Means
K-means is similar to K-medoids
In K-medoids, instead of finding mean values of clusters, “centers” of clusters are found.
“Centers” are points that minimize the total distance to other points in that cluster.
![Page 241: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/241.jpg)
Hierarchical Clustering
Specify measure of distance (dissimilarity) between pairs of observations (ie, construct a distance matrix).
Produce hierarchical representations in which clusters at each level are created by merging clusters at the next lower level.
Lowest Level: Each cluster is a single observation.
Highest Level: One cluster contains all data.
![Page 242: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/242.jpg)
Hierarchical Clustering (continued)
Bottom up: Start at lowest level (single observations) and merge selected pair (most similar) into cluster. Using definition of distance between observations and clusters, continue...
Top down: Recursively split data.
Each level of the hierarchy represents particular grouping of the data into disjoint clusters of observations
Generally, significance measures on clusters are not considered. Exception: Fraley & Raftery, UW-TR, 1998.
![Page 243: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/243.jpg)
Bottom Up Clustering: 3 common approaches
Single Linkage: Distance taken to be minimum distance among all pairwise distances
Complete Linkage: Distance taken to be maximum distance among all pairwise distances.
Average Linkage: Distance measure is averaged across all pairwise distances.
',
'
min ii
HiGi
SL dd
',
'
max ii
HiGi
CL dd
','
1, ii
Gi HiHGAL d
NNHGd
![Page 244: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/244.jpg)
Bottom Up Clustering: Comments
If data exhibits strong clustering features (quantified by measure d) and each of the clusters is well separated from the others, then the 3 methods will produce similar results.
With SingleLinkage, there is a tendency to combine observations linked by a series of close intermediate observations (“chaining”). The clusters might not be compact.
With Complete Linkage, compact clusters are obtained; but the distance between clusters might be small.
![Page 245: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/245.jpg)
Hierarchical Clustering: Eisen et al., PNAS, 1998
jiFor genes i and j,
ji x
jOSkjN
k x
iOSkiji
xxxx
Nxxd ,,
1
,,1,
N
i
OSiG N
GG
1
2
GOS = 0; Note that when GOS is mean of observations on G, d is the correlation coefficient.
![Page 246: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/246.jpg)
Hierarchical Clustering: Eisen et al.
![Page 247: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/247.jpg)
Comments on Hierarchical Clustering
Intuitive
Biological Information can (but often is not) incorporated into measures of distance
Nice as a descriptive or diagnostic tool
Cluster order is arbitrary
Where one “cuts” the tree is arbitrary
No confidence measures are applied to clusters.
CAUTION!
![Page 248: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/248.jpg)
Model Based Approaches for Unsupervised Clustering
Model Assumptions are made
Confidence or Probability of genes in particular groups can be assessed.
*** Many methods to identify DE genes can be thought of as model based clustering approaches where clustering is done using gene specific summaries.
![Page 249: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/249.jpg)
EBarrays: Results on 4 group comparison (DMBA treated)
![Page 250: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/250.jpg)
EBarrays: Data
10 Affy chips from non-treated; 14 from treated (DMBA)
![Page 251: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/251.jpg)
EBarrays: Identification of a few interesting genes
Gene ID COP CI CII WF P0 P1 P2 P3
J00801 3066 4777 995 9083 0.05 0.95 0 0
0.04 0.96 0 0
L08100 4368 1278 14162 0 1 0 0
0 1 0 0
J00772 392 122 679 0.04 0.96 0 0
0.97 0.02 0.00 0.01
![Page 252: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/252.jpg)
POE: Probability of Expression
Journal of Royal Statistical Society 64: 717-736 (with discussion), 2002.
**Additional Details in Parmigiani et al., 2003.
A Statistical Framework for expression based molecular classification in cancer
by G. Parmigiani, E.S. Garrett, R. Anbazhagan, E. Gabrielson
![Page 253: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/253.jpg)
POE: Probability of Expression
![Page 254: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/254.jpg)
POE:
Model gene expression using latent categories (on, off , baseline)
Use model to
1. Remove noise prior to clustering
2. Defining molecular subclasses
3. Determine probability that particular gene is in a class.
![Page 255: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/255.jpg)
POE:
Model gene expression using latent categories (on, off , baseline)
Use model to
1. Remove noise prior to clustering
2. Define molecular subclasses
3. Determine probability that particular gene is in a class
over, under, neutral
![Page 256: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/256.jpg)
POE:
For each gene and tumor, calculate the probability that the gene is expressed at baseline, over-expressed, or under-expressed in that tumor.
Identify clusters of genes based on this probability. Identify representative (“seed”) genes within each cluster.
Identify patterns (“profiles”) of expression across seed genes. For each tumor, calculate the posterior probability of each expression pattern (“profile”).
THIS CLASSIFIES TUMORS !
t 1 t 2 ………… t N
g 1
g 2
.
.
g m
m genes
N tumors
![Page 257: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/257.jpg)
POE: Basic Idea Behind Model
A given gene g can be over-expressed, under-expressed, or neutral.
Suppose there are K tumor classes
If gene g* is related to tumor class, then the distribution of expression values of g* will be different in at least one of the k classes.
Currently assumes classes are not known (unsupervised), but they are working on extending this.
![Page 258: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/258.jpg)
POE
• Notation:
• Modeling observed gene expression, agt:
• For gene g, the proportions of differentially expressed tumors in the population of unclassified tumors are
e g t
e g t
e g t
g t
g t
g t
1
0
1
g en e h as ab n o rm a lly lo w ex p ress io n in tu m o r
g e n e h as n o rm a l ex p re ss io n in tu m o r
g e n e h as ab n o rm a lly h ig h ex p re ss io n in tu m o r
a e e f eg t g t e g| ( ) ~ ( ) { , , }, 1 0 1
g g t g g tP e P e ( ) ( )1 1
Garrett JSM 2002
![Page 259: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/259.jpg)
POE: Quantities of Interest
p P e a f f
f a
f a f a
g t g t g t g g g g
g g g t
g g g t g g g g t
( | , , , , )
( )
( ) ( ) ( )
, ,
,
, ,
1
1
1 0
1
1 0
p P e a f f
f a
f a f a
g t g t g t g g g g
g g g t
g g g t g g g g t
( | , , , , )
( )
( ) ( ) ( )
, ,
,
, ,
1
1
1 0
1
1 0
Interpretation: The probability that gene g in tumor t is over expressed given observed expression and the model parameters
Interpretation: The probability that gene g in tumor t is under expressed given observed expression and the model parameters
Garrett JSM 2002
![Page 260: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/260.jpg)
POE: Distributional Assumptions
f U
f N
f U
g g t g t g
g t g g
g t g t g g
1
0
1
,
,
,
( ) ( , )
( ) ( , )
( ) ( , )
Empirical Bayes Approach: Could put priors on unknowns and integrate to give predictive distributions. Then maximize the marginal likelihood to identify unknowns.
![Page 261: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/261.jpg)
POE: Distributional Assumptions
t : Sample expression for normal expression levels.
g: gene effect in gene g for normal expression.
(g+/g) > r where r is approximately 5.
f -1,g
f 1,g
f 0,g
t + g
![Page 262: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/262.jpg)
POE: Distributional Assumptions
After fitting, parameters can be used to “de noise” data or tocluster genes.
p gt +, p gt -, and p gt 0 are the most important quantities.
g
g
g
g
g
g
N
G
E
E
N
N
| , ~ ( , )
| , ~ ( , )
| ~ ( )
| ~ ( )
( ) | , ~ ( , )
( ) | , ~ ( , )
2
lo g it
lo g it
![Page 263: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/263.jpg)
POE: “De-noised” measures of expression
a g t ~ N( gt, g)
Eappa gtgtgtgtgtgtgt (|,)()()
Normal class: g t = t + g with g unknown
Elevated class: g t - t - g ~ U ( 0, k g+)
Low expression class: g t - t - g ~ U ( k g-, 0)
Posterior means of gt can be used as estimates of expression values.
![Page 264: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/264.jpg)
POE: Cluster genes
1. Choose DE pattern of interest. Indicate what proportion of genes are over-expressed and under-expressed.
2. For each gene, using p gt+ and p gt
-, calculate the probability that the samples have a pattern of the type specified. Sort by this probability.
3. Calculate a J x J matrix of “gene agreement”:
)0()1()1(1 )1()()()|,...,( gtgtgt eI
gtgteI
gtt
eIgtgTg ppppeeP
0
1
0 11 mi
I
igimigimigigm ppppppr
![Page 265: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/265.jpg)
POE: Cluster genes
4. Identify genes with “high coherence”
5. Out of this group, pick gene with the highest probability calculated from step 2 (seed gene).
6. Group genes if they are similar to the seed gene.
7. Remove this group and repeat.
Should repeat process using different initial patterns.
Results in some number of seed genes (for example, say 4).
![Page 266: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/266.jpg)
POE: Comments on seed genes (pg. 379 of Parmigiani et al. text)
Any number of seed genes can be used to create a collection of profiles. s genes gives 3s possible profiles.
“ Profiles based on 4 or more genes are seldom required with the sample sizes and signal-to-noise ratio achievable.”
![Page 267: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/267.jpg)
POE: Creating molecular profiles
For each tumor t, calculate the posterior probabilities of each expression profile using p sg1, t +, p sg1, t -, p sg1, t 0, ...
This classifies tumors !
Each of the 4 seed genes could be under-expressed (-1), over-expressed (1), or neither (0). This gives 34 = 81 patterns of expression (or profiles).
P1 P2 P3 ...... P81
sg1 0 0 -1 1
sg2 0 1 0 1
sg3 0 0 -1 1
sg4 0 0 1 1
![Page 268: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/268.jpg)
POE: Quotes from Garrett & Parmigiani in Parmigiani et al. text 2003
“ The benefit of [the Bayesian hierarchical modeling] approach is that it borrows strength across genes using the entire genomic distribution instead of fitting a separate independent model for each gene.”
“Hierarchical Bayesian models have been shown to have appealing properties in estimation of large vectors of related quantities.”
![Page 269: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/269.jpg)
Comments on POE
Unsupervised, model based clustering approach. The model accounts for measurement errors.
NOT intended for gene clustering.
Uses scale-independent measures of expression which allows combination of data across platforms
Defines a molecular profile based on a small number of genes. This could be useful clinically.
![Page 270: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/270.jpg)
TIME SERIES ANALYSIS OF MICROARRAY DATA
![Page 271: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/271.jpg)
Analysis of Microarray Time Series
Manuscript in Progress
by
M. Yuan and C.M. Kendziorski
![Page 272: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/272.jpg)
Methods for Microarray Time Series Data
Every method that we know considers TS data in one condition. General goal is to cluster genes with similar expression patterns over time.
We consider TS data in multiple conditions. Group genes based on differential expression patterns over time.
![Page 273: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/273.jpg)
Microarray Time Series: Example
• Two treatments
• 6 time points: 0, 2, 6, 24, 48, 120 Hours
• Number of Genes: 12625
![Page 274: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/274.jpg)
Microarray Time Series: Apply EBarrays at each time point
• Differentially expressed genes identified by Ebarrays
• From 24 hrs to 48 hrs, from 48 hrs to 120hrs
0 Hr 2 Hrs 6 Hrs 24 Hrs 48 Hrs 120 Hrs
0 9 0 28 170 333
Pr(DE|DE) Pr(DE)
48 Hrs 6/28 170/12625
120 Hrs 36/170 333/12625
![Page 275: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/275.jpg)
Microarray Time Series: Correlation
• What if there is no correlation?
Pr(DE|DE)=Pr(DE)
• Why should we care about correlation?
Pr(DE)f(x|DE)/Pr(EE)f(x|EE)If Pr(DE) is large, it is easier to claim DE
If Pr(DE) is small, it is harder to claim DE
![Page 276: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/276.jpg)
Microarray Time Series: Correlation
• What if there is no correlation?
Pr(DE|DE) = Pr(DE)
• Why should we care about correlation?
Pr(DE)f(x|DE)/Pr(EE)f(x|EE)If Pr(DE) is large, it is easier to claim DE
If Pr(DE) is small, it is harder to claim DE
![Page 277: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/277.jpg)
Microarray Time Series: How much information did we lose ?
• Given that a gene is DE at 24 Hours
We could claim DE at 48 Hours if f(x|DE)>3.67f(x|EE)
If we do not consider correlation, we claim DE at 48 Hours if f(x|DE)>74.26f(x|EE)
• Given that a gene is DE at 48 Hours
We could claim DE at 48 Hours if f(x|DE)>3.72f(x|EE)
If we do not consider correlation, we claim DE at 48 Hours if f(x|DE)>36.91f(x|EE)
![Page 278: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/278.jpg)
Microarray Time Series: HMM Model Structure
• Pattern process S[t]: Pattern of expression at time t.
Treatment vs Control: DE or EE
Compare treatments: i.e. 5 patterns for 3 trts
Markov Process
• Expression vector x[tk]: K expressions observed at time t.
Distributed according to S[t]
Conditional independent given S[t]
![Page 279: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/279.jpg)
Microarray Time Series: HMM Model Structure
…… ……
Gene 1
Gene 2
Gene n
Gene 1
Gene 2
Gene n
Patterns HMM Expression Vectors
S[1]
X[11],…,x[1K]
S[2]
X[21],…,x[2K]
S[3]
X[31],…,x[3K]
……
![Page 280: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/280.jpg)
Microarray Time Series: Options to Specify HMM
• Marginal expression distribution using EBarrays
• Transit matrix Pr(S[t]|S[t-1]) free of time Homogeneous HMM
• Force Pr(DE|S[t-1]=DE)=Pr(DE|S[t-1]=EE) Independent analysis
• Homogeneous independent analysis Constraint that Pr(DE) is constant over time
![Page 281: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/281.jpg)
Microarray Time Series: Estimation Using EM
• Infer unknown pattern process from observed expression data – Maximum a Posteriori (MAP): max Pr(S[t]=i|X)
• Unknowns:Parameters associated with f(x|S)Parameters associated with pattern process
• EM algorithmE Step: parameters pattern process (Baum-Welch)M Step: pattern process parameters (MLE)
![Page 282: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/282.jpg)
Microarray Time Series: Cluster Genes Into Patterns
• Maximum a posteriori
max Pr(S[t], t=1,…,T|X)
• Viterbi Algorithm
![Page 283: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/283.jpg)
Microarray Time Series: Simulation Study
• 6 time points
• 2 treatments
• 1500 genes
• Proportion of DE at the first time point is 0.1
• Pr(DE|EE)=0.1
• Pr(DE|DE)=0.1, 0.5, 0.7
![Page 284: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/284.jpg)
Microarray Time Series: Simulation Results
P(DE|DE) Method Time 1 Time 2 Time 3 Time 4 Time 5 Time 6
0.1
EB 55/60 57/59 66/67 82/101 93/107 98/115
HMM 57/62 62/67 68/69 81/98 85/94 96/108
0.5
EB 95/106 92/102 123/142 126/136 139/151 137/145
HMM 97/109 116/129 139/155 145/155 152/173 146/160
0.7
EB 63/75 123/132 165/191 203/230 158/173 172/184
HMM 72/88 144/154 191/213 236/258 214/237 201/224
![Page 285: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/285.jpg)
Microarray Time Series: EBarrays vs. HMM
• Differentially expressed genes identified by Ebarrays based HMM
• How many more genes identified:
0 Hr 2 Hrs 6 Hrs 24 Hrs 48 Hrs 128 Hrs
0 9 0 138 333 475
0 Hr 2 Hrs 6 Hrs 24 Hrs 48 Hrs 128 Hrs
0 0 0 110 163 142
![Page 286: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/286.jpg)
Microarray Time Series: Could it be so good ?
• Simulate data for the last two time points with parameters estimated from the HMM
• Performance comparisonMethod 48 Hours 120 Hours
EB 259/269 374/390
HMM 302/321 469/498
![Page 287: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/287.jpg)
Microarray Time Series: Comments
• Correlation over time does exist in most studies.
• Taking correlation over time into account can significantly improve the efficiency of method to identify DE genes.
• HMM provide a flexible way to model the correlation over time
• Ebarrays based HMM is a useful option to analyze microarray time series data
• Technical Report coming soon...
![Page 288: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/288.jpg)
EXPERIMENTAL DESIGN
![Page 289: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/289.jpg)
Experimental Design Questions and Overview
What to print/spot on the array ?
How many pieces of one gene ?
Replicates of a gene ?
Housekeeping or other control spots ?
How to arrange spots/genes on the array ?
Spatial Bias
Print Tip Bias
cDNA ?
Affy (done)cDNA (~done)
Affy (varies ~11)
![Page 290: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/290.jpg)
Experimental Design Questions and Overview (continued)
What to hybridize onto the array ?
What is reference/control ?
How is labelling done ?
Should samples be pooled ?
cDNA ?
Affy (can be ?)
cDNA (Dye swaps ?)
Affy (done)
cDNA (?)
Affy (?)
![Page 291: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/291.jpg)
Actual Design
The actual design considers questions previously stated, but also includes issues of replication.
How many arrays should one use ?
How should samples be allocated to arrays ?
Answers to these questions will depend on three sources of variation: Biological, Technical, and Measurement Error.
In addition, the goal of the experiment should affect its design !
![Page 292: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/292.jpg)
Sources of Variation
Biological Variation: Subject to subject variation. Intrinsic to the organisms studied. This CAN NEVER be reduced, but its effect can perhaps be reduced (e.g. by pooling biological samples).
Technical Variation: Introduced during extraction, labelling, and hybridization. Quantified (estimated) by hybridizing multiple mRNA samples from the same individual to many arrays. Also called array to array variation (caution: so are other sources of variation).
Measurement Error: Introduced when reading signals. Measured within a single array. Multiple spots on one array can reduce the effect of measurement error.
![Page 293: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/293.jpg)
Loop Design
Graphical representation of the loop design indicates
Which differences can be estimated
Precision of the estimates
For example,
A and B can only be compared if there is a path from A to B
Here, A and B can be compared directly or through C. The direct comparison (log(A/B)) is less variable than
log (A/B) = log (A/C) + log (C/B)
A
BC
![Page 294: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/294.jpg)
Graphical Representation of Loop Designs
A
BC
Simplest loop design
A B
A B
A B
A B
A B
A B5
![Page 295: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/295.jpg)
Comparison Among Sources of mRNA
Consider sources A, B, and C to be of interest
Dudoit and Speed method (mult t) 3 times (AB, AC, BC). This is not optimal, but will give adjusted p-values.
ANOVA (Kerr et al., Wolfinger et al.) Better to use all the data at once, but there is no accounting for multiple tests with these approaches. Rank ordering provided might be useful.
EBarrays (Newton et al., Kendziorski et al.) Can handle multiple conditions and accounts for multiple tests. Must specify patterns. Computational issues.
![Page 296: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/296.jpg)
Evaluation of Designs
Each approach gives a gene specific summary score. The scores depend on biological, technical, and measurement error variability.
Different designs result in different allocations of the variance components.
Evaluations of designs is often done by considering variability associated with resulting gene expression estimate.
![Page 297: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/297.jpg)
Evaluation of Designs
log (A/B) = log (A/R) - log (B/R)
Yang & Speed, NRG, 2002
![Page 298: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/298.jpg)
Feasibility of Design III decreases as the number of conditions increases. With 6 samples, there are 15 pairwise comparisons.
Kerr and Churchill proposed loop designs. No longer strongly recommended (by many including Churchill). Some comparisons are less precise than others. Problems with robustness.
Notes on Designs
![Page 299: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/299.jpg)
Compare A with D
log (A/D) = log (A/B) + log (B/C) + log (C/D)
log (A/D) = log (A/F) + log (F/E) + log (E/D)
What if arrays C and E are bad ?
A Closer Look at the Loop Design
A
B
C
D
E
F
![Page 300: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/300.jpg)
Table 2 from Yang & Speed
Yang & Speed, NRG, 2002
![Page 301: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/301.jpg)
Time Course Experiments
If main focus of the experiment is relative change between T2, T3, T4 and initial time point T1, then a reference design is good.
T1 T2 T3 T4
Here, all comparisons are made with equal efficiency.
If a stable reference is available (T1), this will allow comparisons to be made over a relatively long period of time.
![Page 302: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/302.jpg)
Author note on Reference Design
Gary Churchill has traditionally argued against a standard reference design since almost half the measurements are made on the reference sample (which might be of little or no interest) and the variance can be increased relative to other designs.
However, in his NG Reviews article (December 2002), he notes the advantages:
Paths connnecting 2 samples are 2 steps long.
Good way to handle comparisons across time.
![Page 303: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/303.jpg)
Replication (Review and Note)
Technical Replicates: Yang and Speed also include measurement error here. They define such replicates as ones where target mRNA is from the same extraction (different than GC definition reviewed earlier).
Biological Replicates: mRNA samples from different individuals, different cell lines.
![Page 304: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/304.jpg)
Where to spend resources ?
Gary Churchill (NG Reviews, December 2002):
Correlation from duplicate spots on one array (~95 %)
Same target to multiple arrays (~60-80 %)
Samples from individual inbred mice (~30 %)
Yang & Speed (NG Reviews, August 2002):
Technical replicates generally involve a smaller degree of variation in measurements than biological replicates.
![Page 305: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/305.jpg)
Where to spend resources ?
Gary Churchill (NG Reviews, December 2002):
When measurement is expensive, it is preferable to add experimental units rather than technical replicates.
When the variability of measurements exceeds the variability between experimental units, technical replication can increase precision.
When variability between experimental samples is large and units are not too costly, it may be worthwhile to pool samples.
![Page 306: Statistical Methods for Microarrays Christina Kendziorski Landon Sego Department of Biostatistics and Medical Informatics University of Wisconsin-Madison](https://reader038.vdocuments.site/reader038/viewer/2022110210/56649e695503460f94b65542/html5/thumbnails/306.jpg)
Summary
In short,
Identify goals of the experiment (which comparisons are most important ?)
Identify options
Calculate variability associated with all options
Research options to see how they work in practice !!
Choose design based on variability, feasibility, and cost.