case study i: two-sample analysis

68
Case Study I: Two-Sample Analysis Ru-Fang Yeh October 23, 2004 Genentech Hall Auditorium, Mission Bay, UCSF

Upload: ailani

Post on 31-Jan-2016

57 views

Category:

Documents


0 download

DESCRIPTION

Case Study I: Two-Sample Analysis. Ru-Fang Yeh October 23, 2004 Genentech Hall Auditorium, Mission Bay, UCSF. Biological question. Experimental design. Microarray experiment. Failed. Quality Measurement. Image analysis. Preprocessing. Normalization. Pass. Sample/Condition - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Case Study I: Two-Sample Analysis

Case Study I: Two-Sample Analysis

Ru-Fang Yeh

October 23, 2004 Genentech Hall Auditorium, Mission Bay, UCSF

Page 2: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Biological verification and interpretation

TestingEstimation Discrimination

Analysis

Clustering

Microarray experiment

Experimental design

Image analysis

Normalization

Biological question

Quality Measurement

Failed

Pass

Preprocessing

Sample/ConditionGene 1 2 3 4 … 1 0.46 0.30 0.80 1.51 … 2 -0.10 0.49 0.24 0.06 … 3 0.15 0.74 0.04 0.10 … : …

Annotation

Page 3: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Short-oligonucleotide chip data:• quality assessment,• background correction, • probe-level normalization,• probe set summary

Two-color spotted array data:• quality assessment; diagnostic plots,• background correction, • array normalization.

CEL, CDF files gpr, gal files

probes by sample matrix of log-ratios or log-intensities

Analysis of expression data:• Identify D.E. genes, estimation and testing,• clustering, • discrimination, and etc.

Qu

alit

y as

sess

men

tP

re-p

roce

ssin

g Array CGH data:•quality assessment; diagnostic plots,•, background correction • clones summary; • array normalization.

UCSF spot file

Imag

e an

alys

isA

nal

ysis

Page 4: Case Study I: Two-Sample Analysis

Biological Question: Molecular Phenotypic Difference in Rat Alveolar

Type I and Type II Cells

From “Freshly-isolated Rat Alveolar Type I Cells, Type II Cells, and Cultured Type II Cells Have Distinct Molecular

Phenotypes.” (To appear, A J Phys) By Robert Gonzalez, Yee Hwa Yang, Chandi Griffin, Lennell

Allen, Zachary Tigue, and Leland Dobbs.

Page 5: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Pulmonary Alveolar Epithelium

Type I Cells

Type II Cells

Page 6: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Type I Cells Type II Cells% Lung cells ~8% ~15%% Lung internal surface area ~98% ~2%Volume / cell ~2000 µm3 ~400 µm3

Surface / area ~5300 µm2 ~100 µm2

Stone, AJRCMB 1992

• Morphologic characteristics conserved across the entire range of mammals.

Known/Possible - water and ion transport - surfactant metabolismFunctions - host defense (oxidants - ion transport

& microorganisms - produce immune

- tumor suppression effector molecules

- matrix preservation - Progenitor cells for TI cells after oxidant

injury (and in lung

development)

Alveolar Epithelial Type I and Type II Cells

Page 7: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Type II cell

Proliferation

Type I cell

TransdifferentiationThe process by which one “stable” (differentiated) cellular phenotype changes into a different “stable” cellular phenotype.

Evans, 1975Adamson, 1975

Alveolar Epithelial Cell Lineage Following Lung Injury

Page 8: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Study Objectives

Long term goals: Increase understanding of• alveolar epithelial cell lineages. • the mechanisms that regulate alveolar epithelial development and

differentiation.

Use microarrays to establish molecular profiles of TI and TII cells:• Identification of differences in expression of single genes to

- provide additional marker genes- develop new hypotheses about cellular functions of each cell type

• To determine changing patterns of expression of groups of genes- to understand processes of (trans)-differentiation in vivo and in vitro- to identify candidate factors (transcription cascades) important in

regulating differentiation

Page 9: Case Study I: Two-Sample Analysis

Gene Expression Experiment

TII Cells Cultured TII Cells

TI Cells

Page 10: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Freshly Isolated TI and TII Cells

TII CELLS TI CELLS

TI cell fragment

TII cell

Page 11: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

• Matrix (EHS, contracted collagen gels)• Soluble factors (ex: KGF)• Apical surface exposed to air• Mechanical contraction

• Matrix (TCP, fibronectin)

• Apical surface covered by liquid• Mechanical distention

Type II Cells in vitro

Page 12: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Experimental design

• Probe: Affymetrix Rat U34 chip A, with 8799 probe sets.

• Target: 4 biological replicates of each cell type: - TID0: freshly isolated TI cells- TIID0: freshly isolated TII cells- TIID7: cultured TII cells (for 7 days)

[traditionally used as a model for TI day 0 cells]

Cell purity criterion: < 2% cross-contamination

Page 13: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Preparing mRNA samples:

Dissection of tissue

RNA Isolation

Amplification

Probelabelling

Hybridization

Page 14: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Preparing mRNA samples:

Dissection of tissue

RNA Isolation

Amplification

Probelabelling

Hybridization

Biological Replicate

Page 15: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Preparing mRNA samples:

Dissection of tissue

RNA Isolation

Amplification

Probelabelling

Hybridization

Technical replicate

Page 16: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Analysis Aims

Main Questions: Establish molecular profiles of TI and TII cells:

1. Identification of differences in expression of single genes to:- provide additional marker genes

- develop new hypotheses about cellular functions of each cell type.

2. To understand the process of (trans)-differentiation in vivo and in vitro

3. To identify candidate factors (transcription cascades) important in regulating differentiation.

Approaches:

1. Identify differentially expressed (DE) genes between TID0 and TII D0.

2. Comparing TID0 and TIID7; are they similar?

3. Finding common regulatory element (transcription factor binding site) in groups of candidate co-regulated genes.

Page 17: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Biological verification and interpretation

TestingEstimation Discrimination

Analysis

Clustering

Microarray experiment

Experimental design

Image analysis

Normalization

Biological question

Quality Measurement

Failed

Pass

Preprocessing

Sample/ConditionGene 1 2 3 4 … 1 0.46 0.30 0.80 1.51 … 2 -0.10 0.49 0.24 0.06 … 3 0.15 0.74 0.04 0.10 … : …

Annotation

Page 18: Case Study I: Two-Sample Analysis

Preprocessing

• Quality Assessment.• Background subtraction.• Normalization.• Summarization of probe sets value.

Page 19: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

High Density Oligonucleotide Arrays (Affymetrix)

24µm24µm

Millions of copies of a specificMillions of copies of a specificoligonucleotide probe per probe celloligonucleotide probe per probe cell

Image of Hybridized Probe ArrayImage of Hybridized Probe Array

~500,000 probe cells on each ~500,000 probe cells on each chipchip

Single stranded, Single stranded, labeled RNA labeled RNA targettarget

Oligonucleotide Oligonucleotide probeprobe

**

**

*

GeneChipGeneChip Probe ArrayProbe Array

Hybridized Hybridized Probe CellProbe Cell

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

1.28cm1.28cm

Page 20: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

How Affymetrix Arrays Are Made

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Figure from Lipshutz et al. Nat. Gen. 1999.

Page 21: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

5’ 3’

mRNA reference sequence

…TCGTCTGTATCACAGACACAAAGTTGACTG…PM: CAGACATAGTGTCTGTGTTTCAACTMM: CAGACATAGTGTGTGTGTTTCAACT

MMFluorescent probe intensity

For one gene (probe set): 16 probes/gene for Rat U34

PM

Page 22: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

DAT FileHybridization+ Scanning

Image analysis

CEL File

CHP FileIntensity value

Absent / Present call

CDF File+

Text FileProbe ID +

Log2 (Intensity)

Excel File

RMAGCRMA

MASGCOS

dChip

Report File, quality

Preprocessing0. Quality Assessment.1. Background subtraction

(B).2. Normalization (N).3. Summarization of

probe sets values (S).

Page 23: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Quantile NormalizationBolstad et al (2003)

• Quantile normalization is a method to make the distribution of probe intensities the same for every chip.

• The normalization distribution is chosen by averaging each quantile across chips.

Page 24: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Probe Set Summarization:Robust Multi-array Average -- Irizarry et al (2003)

• The RMA model assumes that each probe cell is made up of

Log2 Normalized (Observed Intensity – Background) =

Chip effect + Probe-specific effect + error

• The expression level is estimated using a robust procedure (such as median polish or IRLS) to fit the above linear model.

PM

RMA values: log2 Expression for chip i

Page 25: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Summarization Method Comparison: AffyComp http://affycomp.biostat.jhsph.edu/

Median SD across replicates

average false positivesif we use fold-change > 2 as a cut-off

Page 26: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Software

• Affymetrix: MAS v5.1 or GCOS v1.0

• RMA (Robust Multi-array Average) / GCRMA / PLM: - Bioconductor http://www.bioconductor.org

- affylmGUI http://bioinf.wehi.edu.au/affylmGUI/

- RMAExpress http://stat.www.berkeley.edu/~bolstad/RMAExpress/RMAExpress.html

- Axon: Acuity (RMA only, commercial)- GeneTraffic (RMA only, commercial)

• Li & Wong’s MBEI (Multiplicative Model-Based Expression Index):- dChip http://www.dchip.org/

Page 27: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Qualitative Quality Assessment Using PLM

Weights Residuals

More QC Examples:http://stat-www.berkeley.edu/users/bolstad/PLMImageGallery/index.html

Page 28: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

QC with affyPLM

Page 29: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

QC with boxplots

Page 30: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

RMAExpress

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 31: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

affylmGUI

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 32: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Biological verification and interpretation

TestingEstimation Discrimination

Analysis

Clustering

Microarray experiment

Experimental design

Image analysis

Normalization

Biological question

Quality Measurement

Failed

Pass

Preprocessing

Sample/ConditionGene 1 2 3 4 … 1 0.46 0.30 0.80 1.51 … 2 -0.10 0.49 0.24 0.06 … 3 0.15 0.74 0.04 0.10 … : …

Annotation

Page 33: Case Study I: Two-Sample Analysis

Analysis

1. Identify differentially expressed (DE) genes between TID0 and TII D0.

2. Compare TID0 and TIID7.3. Beyond expression.

Page 34: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

~ 8800 probe sets

50 + 131 > 4x (M>2)163 + 401 > 2x

Simple fold-change rules give no assessment of statistical significance

Need to construct test statistics incorporating variability estimates (from replicates).

DE by Average Fold-Change (M): Freshly Isolated TI vs TII Cells

TI

TII

4x2x

2x4x

M:

A:

Page 35: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Two-sample t-statistic & p-value

• The two-sample t-statistic

is used to test equality of the group

means 1, 2

• The p-value p* is the probability

that, under the null hypothesis (H0:

1=2), the test statistic is at least

as extreme as the observed value

t*.

T =X 1 − X 2

s 1n1

+ 1n2

t*-t*

p*/2p*/2

Page 36: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

More Two-Sample Statistics

Perform statistical tests on normalized, log-transformed data:

• Standard t-test: assumes normally distributed data in each class

(!), equal variances within classes

• Welch t-test: as above, but allows unequal variances

• Wilcoxon test: non-parametric, rank-based

• permutation test: estimate

the distribution of the test

statistic under the null

hypothesis by permutations

of the sample labels

Page 37: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

When there are few replicates…

• (Fold-change) Averages can be driven by outliers.

• T-statistics can be driven by tiny variances.

Solution: “robust” version of t-statistic

- Replace mean by median

- Replace standard deviation by median absolute deviation

M

M

se(M )

Page 38: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

1. Penalized-t

Trying to find a compromise between solely using t and solely using mean. There are several similar solutions of the following form:

where s = standard deviation.

Question: how to estimate a? - 90th percentile of standard deviations (s values). Efron et al (2000).

- minimizes the coefficient of variation(cv) of the absolute t-values (SAM). Tusher et al (2001)

t* =M

(s + a) / n

Alternative Test Statistics

Page 39: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

2. Moderated t-statistics (G Smyth 2004, Limma):

where is the shrinkage estimate of standard deviation

Other Statistics (cont.)

t* =M

˜ s / n

˜ s 2 =s2d + s0

2d0

d + d0

Pooled sd from all genes

sd for gene i

Estimation is done using an extension to the empirical Bayes method in Lonnstadt &Speed (2002)

Page 40: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

3. B-statistic: log posterior odds ratios

log Pr(gene i IS DE) / Pr(gene i IS NOT DE)

Equivalent to moderated-t in terms of ranking genes.

4. Single-channel methods modeling absolute gene-expression levels:- Newton et al 2001: log-intensities ~ Gamma - Wolfinger et al 2001: linear mixed model on log-intensities

5. Composite methods: Differential Expressed genes via Distance Synthesis (Yang et al 2004) to choose genes that are extreme on all measures by defining a “distance” statistic based on measures of choice.

Other Statistics (cont.)

Page 41: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

DE by Fold Changes, (limma) Moderated-t, B (lods)

Page 42: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Assessing Significance I: Diagnostic Plots

Page 43: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Univariate hypothesis testing: For single gene, test the null hypothesis

H0 : the gene is NOT differentially expressed.

And p-value can be generated via theory or permutation tests.

Is this p-value correct?

• Yes if only looking at ONE gene…

• Will expect 10000*0.01 = 100 genes with p-value < 0.01 in 10,000 non-DE genes!

-- clearly we can’t just use standard p-value thresholds (.05, .01)!

• Need to adjust p-values for meaningful interpretation!

Assessing Significance II: Testing

Page 44: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

(Unadjusted) p-values of moderated-t

Page 45: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Multiple Hypothesis Testing

H0

Ha

Page 46: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Type I Error Rates (False Positives)

• Family-Wise Error Rate (FWER)

Pr(V > 0) = Pr( At least one false positive )

• False Discovery Rate (FDR) -- The FDR (Benjamini & Hochberg 1995) is the expected proportion of type I errors among the rejected hypotheses.

FDR = E(Q),

With Q = V/R, if R > 0

0, if R = 0

Page 47: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Multiple Testing: Controlling a Type I Error Rate

AIM:

For a given type I error rate , use a procedure to select a set of “significant” genes that guarantee a type I error rate .

Page 48: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Adjusted p-values: Controlling the FWER

• The Bonferroni correction: m pg ; most conservative adjustment.

assume independence among genes.

• Sidák: 1-(1-pg)m

• minP (Westfall & Young):

estimated through permutation; allow dependency between genes.

• maxT: replace pg by test statistics Tg, min by max. Less computationally intensive than minP.

• Step-down• Step-up

Choosing all genes with adjusted p-value controls the FWER at level

˜ p g ≤ α

˜ p g = Pr( mink=1,...,m

Pk ≤ pg | H0)

Page 49: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Controlling the FDR (Benjamini/Hochberg)

• Order unadjusted p-values:

• To control FDR = E(V/R) at level , reject the hypothesis

• Adjusted p-values:

• Interpretation: expect 5% false positives among genes with < 0.05 FDR-adjusted p-values.

Page 50: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Adjusted p-values

p=0.01

Page 51: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Adjusted p-values

p=0.01p=0.01

Page 52: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Identify DE Genes: TI vs TII Cells

1. Select a statistic which will rank the genes in order of strength of the evidence for differential expression, from strongest to weakest.

2. Choose a critical value for the ranking of statistics above which any value is considered to be significant.

• Number of estimated DE genes

B-statistics

> 0

Bonferroni-Adj. p-value

< 0.01

Median

Fold- change

> 4x

Median

Fold- change

> 2x

TIID0 vs TID0 1500 538 138 + 68 =206 415 + 193 =608

TIID7 vs TID0 1411 295 80 + 83 =163 528 + 210 =738

Page 53: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

FWER or FDR?

• Choose FWER if high confidence in ALL selected genes is desired (for example, selecting candidate genes for RT-PCR validation). Loss of power due to strong control of type-I error.

• Use more flexible FDR procedures if certain proportions of false positives are tolerable (e.g. gene discovery, selecting candidate co-regulated gene sets for GO/pathway analysis).

Page 54: Case Study I: Two-Sample Analysis

Analysis

1. Identify differentially expressed (DE) genes between TID0 and TII D0.

2. Comparing TID0 and TIID7.3. Beyond differential expression…

Page 55: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Biological verification and interpretation

TestingEstimation Discrimination

Analysis

Clustering

Microarray experiment

Experimental design

Image analysis

Normalization

Biological question

Quality Measurement

Failed

Pass

Preprocessing

Sample/ConditionGene 1 2 3 4 … 1 0.46 0.30 0.80 1.51 … 2 -0.10 0.49 0.24 0.06 … 3 0.15 0.74 0.04 0.10 … : …

Annotation

Page 56: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

What is next ?

• Further experiments: RT-PCR, immunohistochemistry, ELISA…

• Annotation

• Functional categories of DE genes between TID0 and TIID0. - Gene Ontology [http://www.geneontology.org]- GenMapp [http://www.genmapp.org/]- GOStats [http://gostat.wehi.edu.au/]- Bioconductor: GOStats and goTools

• Finding upstream regulatory element with our current experiments alone? - Experimental and methodological challenges.

Page 57: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

RT-PCR Verification

Page 58: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Annotation

Affy ID

L26913_at

GenBank Accession/Refseq

NM_053828

NP_446280

Locuslink

116553

Biochemical pathways

(KEGG)

Nucleotide SequenceGACAAGCCAGCAGCCTAGGCCAGCCCACAGTTCTACAGCTCCCTGGTTCTCTCACTGGCTCTGGGCTTCATGGCGCTCTGGGTGACTGCAGTCCTGGCTCTTGCTTGCCTTGGTGGTCTCGCCGCCCCAGGGCCGGTGCCAAGATCTGTGTCTCTCCCTCTGACCCTTAAGGAGCTTATTGAGGAGCTGAGCAACATCACACAAGACCAGACTCCCCTGTGCAACGGCAGCATGGTATGG

UniGene

Rn.9921

RGD

Il13 Name

Interleukin 13

Gene Symbol

IL13

Swiss-Prot

P42203

GOGO:0005144 [interleukin-13

receptor binding]

GO:0005576 extracellular

GO:0006955 [immune response]

Map PositionChromosome:10q22

39.1 Mb

PubMed 121624387916615

Literature

Page 59: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

GO Functional Category Enrichment

Page 60: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Can we find common transcriptional regulatory elements/motifs in the co-expressed genes?

List of Affy IDs (DE genes) (co-expression co-regulation)

ComputationalMethods

Candidate transcription factor binding sites/motifs

Upstream sequences of co-expressed genes+ additional data

Genome ResourceGenbankEnsembl

UCSC Genome Browser

Biological VerificationChromatin immunoprecipitation…

Hypotheses of Gene modules, network…

EZRetrieve, SOURCE…

TRANSFACExpression

Other genomes

Page 61: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Transcriptional Regulation

Page 62: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Challenges for higher eukaryotes

• Getting the right sequences is hard- (now minor): Transcription start sites (TSS) could be very far from

translation start sites (ATG). Typically undetermined and low prediction accuracy unless full-length cDNAs or 5’EST are available

- Regulatory motifs could be anywhere: promoter (TSS proximity), very far 5’ upstream (a few to hundreds kb), introns, even 3’downstream.

• High signal to noise ratio- motifs are weak and short: 6-12 bp with 8-9 bits of information (~4

conserved bases)

- Large target regions yield high false positives

Page 63: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Computational Motif Finding in Co-expressed Genes1. Supervise approach: mapping known motif sites

68 TI/138 TII Affy IDs marker genes (DE > 4x)

Score matches

Candidate regulator (TF) withbinding sites in DE genes

23 /70 Refseqs: retrieve peptides +repeat masked (-2000,-1)bp

from annotated TSS

Genome ResourceEnsembl/EnsMart www.ensembl.org

Biological VerificationChromatin immunoprecipitation…

EZRetrieve, SOURCE…

TRANSFACwww.gene-regulation.com

0/7 DE TFs6 w/ binding site info

TFblast

#matches > #expected?

200 non-DE genes

Yes*

Page 64: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Mapping Known DE TF Motifs to DE Genes

TF Type-I Type-II

ATF 17 (13) / 14 58 (37) / 44

V-Jun 0 (0) / 0 6 (3) / (<1)

IRF-1 4 (4) / 0 7 (6) / 4

EGR-1/EGR-2 0 (0) / 1 2 (2) / 4

#matches (#genes) / #expected matches

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 65: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Motif Finding in Co-expressed Genes2. De novo Stochastic Algorithms

68 TI/138 TII Affy IDs marker genes (DE > 4x)

Gibbs SamplerAlignACE

MEME

Candidate sequence motifs and sites in DE genes

23 /70 Refseqs: retrieve repeat masked (-2000,-1)bp from TSS

Genome ResourceEnsembl/EnsMart www.ensembl.org

Biological Verification

EZRetrieve, SOURCE…

TRANSFACwww.gene-regulation.com

Computational Verification:Match known TFBS?

Page 66: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Top 12 motifs by AlignACE (default parameters, background %GC =50)

Type-I Type-II

Page 67: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Motif Finding in Co-expressed Genes

1. Mapping known motif sites.Input : Subsets of sequences + known binding sites.

Limited by known sites & lots of false positives.

2. De novo motif finding algorithm.Input : Subsets of Sequences.

Need a good filter to reach at a good subset of sequences. Mostly stochastic so harder to translate results.

3. Regression methods on expression data (REDUCE: Bussemaker et al 2001)

Input : Expression Data + corresponding upstream sequences.

Usually Y ~ X where Y: expression data and X: words/motifs.

4. Phylogenetic Footprinting/Shadowing (Vista: Loots et al 2002)

Input : Subset of upstream sequences of orthologous genes.

Can’t find organism specific sites (estimated 32-40% human sites are not functional in mouse), but could be compensated by using various species for resolution.

Page 68: Case Study I: Two-Sample Analysis

Microarrays: Case Studies and Advanced Analysis

Biological Verification & Interpretation: TI TII (D0 or D7), candidate regulator…

Adjust p-values for multiple testing

Ranking genes for DE: fold-change, moderated-t, lods

Microarray experiment

Experimental design: Affy arrays

Quantile Normalization

RMA Summarization

Biological Question: Alveolar TI vs TII Cells

Quality Measurement

Failed

Pass

Preprocessing

Sample/ConditionGene 1 2 3 4 … 1 0.46 0.30 0.80 1.51 … 2 -0.10 0.49 0.24 0.06 … 3 0.15 0.74 0.04 0.10 … : …

Finding TFBS for co-expressed genes

Conclusion