microarrays wednesday, march 1, 2006 dr. tim hughes ccbr – 160 college st. – room 1302...

MicroarraysWednesday, March 1, 2006

Dr. Tim HughesCCBR – 160 College St. – Room [email protected]

Outline:• Microarray experiments• Normalization • Different types of microarrays• Other applications besides expression profiling• Clustering and interpretation

Suggested reading

Eisen et al., 1998

HARTIGAN, J.A., Clustering Algorithms, Wiley, New York and London (1975).My understanding is that it is no longer in print, but is available on CD.

Jain et al., ACM Computing Surveys, 31(3) 1999 “Data Clustering: a review”.(http://www.amk.alt-neustadt.at/diplom/papers/Clustering/p264-jain.pdf)

Hegde et al., A concise guide to cDNA microarray analysis.Biotechniques. 2000 Sep;29(3):548-50, 552-4, 556

Sherlock G. Analysis of large-scale gene expression data.Curr Opin Immunol. 2000 Apr;12(2):201-5.

www.accessexcellence.org/AB/GG/nucleic.html

Nucleic Acid Hybridization

Microarray expression profiling by 2-color assay (“cDNA arrays”)

Array: PCR products6250 yeast ORFs

hybridized cDNAs:green = controlred = experiment

*Schena et al., 1995

“cDNA microarrays” are essentially dot-blots on glass slides

http://arrayit.com/Products/Printing/Stealth/stealth.html

• This slide was made with 16 pins• 4.5 mm pin spacing matches 384-well plates (16 x 24)• Done with robotics• Slides usually coated with poly-lysine• Spots are usually 100-150 microns• Spot spacing is usually 200-300 microns.• Slides are 25 x 75 mm• Easy to deposit 20K spots/slide

0.45 mm

Common ways to “label” nucleic acids

Random priming of double-stranded DNA:

Poly-T primed cDNA synthesis:

Direct labelling (fluors only):

Amplification:*

*

*

Reaction contains labelled nucleotides

* *

AAAAAAAA

AAAAAAAA

TTTTTTTTTT

Reaction contains labelled nucleotides

*

AAAAAAAA

AAAAAAAA

TTTTTTTTTT-T7 promoter

TTTTTTTTTT-T7 promoter

AAAAAAAA-T7 promoter

* ** *

*

T7 reaction contains labelled nucleotides

“second strand” synthesis

controltreatment

(drug, mutation)

updownunchangednot present

x y z

xx

x

xx

yy

yy

zz z

cDNA pools

Typical use of cDNA microarrays:“Internal” normalization using two colors

Excitation

Emission

532 nm laser (green) excites Cy3Cy3 detected with an emission filter that passes 557-592 nm

635 nm (red) excites Cy5Cy5 detected with an emission filter that passes 650-690 nm.

Both are detected by a photomultiplier tube.

http://www.jacksonimmuno.com/2001site/home/catalog/f-cy3-5.htm

Cy3NHS Ester

Cy5NHS Ester

http://www.ope-tech.com/doc/Cy5_structure.htm

The primary data: two grayscale TIFF files

http://www.axon.com/GN_GenePix4000.html

Cy3 channel(“green”)

Cy5 channel(“red”)

Image processing and normalization: what is microarray data?Microarray data is summary information from image files that come out of the scanner.Image processing: line up grids, flag bad spots, quantitate.

Looking at data from a single experiment

3-AT vs.No drug

wild-type vs.wild-type

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Log10

(Intensity)

Log 1

0(Exp

ress

ion

Rat

io)

Slides: 11120c01 -11121c01

P-value < 0.01

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

P-value < 0.01

Log10

(Intensity)

Log 1

0(Exp

ress

ion

Rat

io)

Slides: 11857c01 -11858c01

log10(average intensity)

-2 -1 0 1 2

log 1

0(r

atio

)lo

g 10(r

atio

)

2

1

0

-1

-2

-2 -1 0 1 2

2

1

0

-1

-2

http://www.mathworks.com/access/helpdesk/help/toolbox/curvefit/ch_data7.html

Lowess smoothing: The names "lowess" and "loess" are derived from the term "locally weighted scatter plot smooth," as both methods use locally weighted linear regression to smooth data.

Find spotsManual editQuantitateNormalize (“Lowess smoothing”)(Locally weighted scatterplot smoothing) Confirm spots outside envelopeSave data, images, spot map

Selected tricks for processing and normalization

(1) High-pass spatial detrending. See: O. Shai, Q. Morris, and B.J. Frey, (2003) Spatial Bias Removal in Microarray Images, University of Toronto Technical Report PSI-2003-21, http://www.psi.utoronto.ca/~ofer/detrendingReport.pdf

(2) VSN – “Variance Stabilizing Normalization”. See:

Huber, W., Von Heydebreck, A., Sultmann, H., Poustka, A., & Vingron, M. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18, Suppl 1, S96-S104 (2002).

Q. Morris, B. Frey, O. Shai

http://www.psi.utoronto.ca/~ofer/detrendingReport.pdf

Other types of arrays

Photolithographic arrays (Affymetrix)

Building up oligonucleotides on a surface:

http://www.affymetrix.com/technology/manufacturing/index.affx

Photolithographic arrays (Affymetrix)

aka “GeneChip”

Arrays are typically 25-mers, with “mismatch” control for specificity

Photolithographic arrays (Affymetrix)Advantages:

Density is limited essentially by the 5 micron resolution of scanners (solution: larger arrays).

Well-developed protocols.

“Industry standard” (largely self-driven).

Disadvantages:

Not all probes work well. Affymetrix has evolved a complicated system to compensate for this, but even “believers” use at least four probes per gene, and usually more.

Single color.

Sample preparation typically requires amplification.

Single supplier; historically intellectual property issues. (i.e. comparisons)

• 25,000 oligos / 1 x 3 inches

• Sequence completely flexible

• 60-mers

G

AGTC

A

CGGG

C

TGAA

Ink-jet arrays (Agilent)

Hughes TR et al. Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer. Nat Biotechnol. 2001 Apr;19(4):342-7.

Ink-jet arrays generally agree with spotted cDNA arrays

Yeast IJS array: ~8 oligos per gene Spo vs. SC

cDNA array

mu

ltip

le o

ligos

cDNA array

sin

gle

olig

o

r = 0.96

HXT3 HXT1

HXT4

r = 0.97

Ink-jet arrays (Agilent)Advantages:

User-specified sequences; “no questions asked”

Sensitivity and specificity are defined and exceed requirement for most expression profiling applications; no amplification required

Virtually every 60-mer is functional

Data correlates well with spotted cDNA arrays

Disadvantages:

Density currently limited to ~45,000 spots per array.

Single supplier (although a protocol is in press for making your own synthesizer!)

“Maskless” arrays (Nimblegen)

http://www.nimblegen.com/technology/manufacture.html

Advantages:

User-specified sequences.

Density is limited essentially by the 5 micron resolution of scanners.

Disadvantages:

New to arena. Performance in initial publication (Nuwaysir et al., Genome Research, 2002) suggests that sensitivity and specificity may be lower than that of Agilent arrays.

Single supplier – although all the parts are there for academics to build one.

Possible IP issues. Hybs are done in Iceland to bypass Affy IP. Nimblegen web site boasts of new partnership with Affymetrix.

“Maskless” arrays (Nimblegen)

Applications beyond expression profiling

• DNA copy number

• Genotyping

• Protein-DNA associations

• Molecular “Barcoding”

• Protein arrays

• Transformation arrays

Identifying DNA binding sites

Science 2000 Dec 22;290(5500):2306-9 Genome-wide location and function of DNA binding proteins.

Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, Volkert TL, Wilson CJ, Bell SP, Young RA.

Analysis of multiple experiments

• Comparisons

• Clustering

• Predicting gene functions

• Finding promoter elements

-1.5 -1 -0.5 0 0.5 1 1.5

-1.5

-1

-0.5

0

0.5

1

1.5

log10(ratio), vma8 / WT

log

10(

ratio

), c

up5

/ W

T

VMA8

CUP5

r = 0.88

-1.5 -1 -0.5 0 0.5 1 1.5

-1.5

-1

-0.5

0

0.5

1

1.5

CUP5

MRT4

r = 0.09

log

10(

ratio

), c

up5

/ W

Tlog10(ratio), mrt4 / WT

Comparing data from two experiments

-1.5 -1 -0.5 0 0.5 1.0 1.5-1.5 -1 -0.5 0 0.5 1.0 1.5

1.5 1

0.5 0

-0.5

-1.0

-1.5

1.5 1

0.5 0

-0.5

-1.0

-1.5

The behavior of two genes over many experiments can be compared in the same fashion

scatter plot of ratios (intensity not displayed)

2-D clusteringStep 1: cut experiments and transcripts

falling below P-value and ratio thresholds

-10 -5 -2 1 2 5 10

fold repression fold induction

transcript response index

exp

erim

ent

ind

ex

44 experimentsx

407 genes

2-D clustering

-10 -5 -2 1 2 5 10

fold repression fold induction

Step 2: cluster experiments and transcriptstranscript response index

exp

erim

ent

ind

ex

RHO O/XPKC O/X

ste mutants

treatment withalpha-factor

Data from Roberts et al., Science (2000)

K = 10 #1 #2 #3

There are many types of clustering. One example: K-means (must choose K)

See: Sherlock G. Analysis of large-scale gene expression data.Curr Opin Immunol. 2000 Apr;12(2):201-5.

Basics of clustering freeware: Eisen’s “Cluster” and “Treeview”

Mike Eisen's web site: rana.lbl.gov/EisenSoftware.htm

“Cluster” loads an Excel file (save as tab-delimited text) in the following format:

Cluster

Treeview

(also: “TreeArrange” - http://monod.uwaterloo.ca/downloads/treearrange/)There are also many commercial programs available.

mRNA

protein

nucleus

cell

Microarray expression data

Co-regulated groups of genes

Functional categories

Predict functions of new genes

cis, trans regulators

Cluster labelamino acid metabolismarginine biosynthesisarginine catabolismaromatic AA metabolismasparagine biosynthesisbranched chain AA synthlysine biosynthesismethionine biosynthesissulfur AA tnsprt, metabadenine biosynthesisaldehyde metabolismbiotin biosynthesiscitrate metabolismergosterol biosynthesisfatty acid biosynthesisgluconeogenesisNAD biosynthesisone-carbon metabolismpyridoxine metabolismthiamin biosynthesis 1thiamin biosynthesis 2hexose transportsodium ion transportpolyamine transportnucleocytoplasmic transportribosome/RNA biogenesisribosomal proteinstranslational elongationprotein foldingsecretionprotein glycosylationvesicle-mediated transportproteasomevacuole fusionmitoribosome/respirationMitochond. electron trans.iron transport/TCA cycleChromatin/transcriptionhistonesMCM2/3/6/CDC47DNA replicationmitotic cell cycleCLB1/CLB6/BBP1cytokinesisdevelopmentpheromone responseconjugationsporulation/meiosisresponse to oxidative stressstress/heat shock

Sample genesTRP4, HIS3ARG1, ARG3CAR1, CAR2ARO9, ARO10ASN1, ASN2ILV1,2,3,6LYS2, LYS9MET3,16,28MUP1, MHT1ADE1,4,8AAD4,14,16BIO3,4CIT1,2ERG1,5,11FAS1,FAS2PGK1, TDH1,2,3BNA4,6GCV1,2,3SNO1, SNZ1THI5,12THI2,20HXT4,GSY1ENA1,2,5TPO2,3KAP123,NUP100MAK16,CBF5RPS1A,RPL28TEF1,2SSA1,HSP60VTH1,KRE11ALG6,CAX4VPS5,IMH1RPN6,RPT5VTC1,3,4,PHO84MRPL1,MRPS5ATP1,COX4FRE1,FET3SNF2,CHD1,DOT6HTA1,HHF1MCM2,3,6RFA1,POL12SPC110,CIN8CLB1,6CTS1,EGT2PAM1,GIC2FUS3,FAR1CIK1,KAR3SPO11,SPO19GDH3,HYR1 HSP104,SSA4

Candidate regulatorGCN4ARG80/81ARG80/81/UME6/RPD3ARO80GCN4/HAP1/HAP2LEU3, GCN4LYS14CBF1, MET28, MET32MET31,MET32BAS1, BAS2, GCN4

RTG3ECM22/UPC2INO4GCR1

THI2/THI3THI2/THI3GCR1NRG1,MIG1HAA1RRPE-binding factorPAC/RRPE-binding factors

HAC1,ROX1RLM1XBP1

RPN4PHO4

HAP2/3/4/5MAC1/RCS1/AFT1/PDR1/3

HIR1,HIR2ECBMCBHCM1FKH1ACE2,SWI4

MATALPHA2,STE12KAR4NDT80ROX1,MSN2,MSN4MSN2,MSN4

249

gen

es1,

226

gen

esNon-overlapping yeast gene expression

clusters424 experiments

Chua et al., 2004

Analyzing clusters:

amino acid biosynthesis (p<10-14)amino acid metabolism (p<10-14)

methionine metabolism (p=1.07×10-7)

Some web resources for promoter analysis:

YRSA (http://forkhead.cgb.ki.se/YRSA/define1.htm)AlignACE (http://atlas.med.harvard.edu/cgi-bin/fullanalysis.pl)

**http://area51.med.utoronto.ca/FUNSPEC.html

GO-Biological Process categories

Broad

Mid-level

Narrow eye pigment metabolism

eye morphogenesis

pigment metabolism

striated muscle contraction

ATP biosynthesis

vision

CNS development

insulin secretion

Very Broadmetabolism

163

137

21

36

25

33

34

1548

# annotated genes(mouse)

development 2341

GO-Biological Process hierarchy

eye pigment metabolism

eye morphogenesis

pigment metabolism

CNS development

metabolism

development

Other types of categorical annotations:

KEGG, EC numbers (describe biochemical “pathways”)

MIPS, YPD (yeast databases – older than GO)

Results of individual studies (localization, 2-hybrid screens, protein complexes, etc.

Sequence motifs, structural domains (pfam, SMART)

Other people’s microarray clusters

etc.

**When testing clusters against many different types of categorical annotations, should consider correcting for multiple-testing, and also consider that categories are often not independent

mRNA

protein

nucleus

cell

Big questions:

To what degree are functional pathways coordinately regulated?

What controls the observed regulations?

Exploring mouse gene expression using Inkjet Oligonucleotide Arrays

• 22,000 oligos / 1 x 3 inches

• Sequence completely flexible

• Mouse “42K” array: NCBI GenomeScan predictions (“XM”) on mouse draft sequence G

AGTC

A

CGGG

C

TGAA

**Wen Zhang

• Includes:25K with cDNA

(75% of 18K RefSeq genes)30K with cDNA or EST12K potential new genes

Exploring mouse gene expression using Inkjet Oligonucleotide Arrays

Collect 55 different mouse tissues from experts:

Janet RossantJane AubinDerek van der KooyMichael FehlingsBenoit Bruneau**

**Wen Zhang

Analyze mRNA levels on arrays(1 g poly-A)

Analysis of 55 mouse tissues: QC

**Malina Bakowski, Blencowe lab

Unch

ar.

cDNA

EST

Gen

e tra

pTr

ansc

riptio

n fa

ctor

RNA

bind

ing/

RS d

omai

n

Test

isO

lfact

ory

bulb

Brai

nEy

eES Sk

el. M

uscle

Live

rFe

mur

Teet

hPl

acen

taPr

osta

teLy

mph

nod

eSp

leen

Digi

tTo

ngue

Trac

hea

Larg

e in

test

ine

Colo

n

Test

isO

lfact

ory

bulb

Brai

nEy

eES Sk

el.l

Mus

cleLi

ver

Fem

urTe

eth

Plac

enta

Pros

tate

Lym

ph n

ode

Sple

enDi

git

Tong

ueTr

ache

aLa

rge

inte

stin

eCo

lon

Hypothetical protein FLJ20519Testis nuclear RNA binding ptn (Tenr)

DEAD box polypeptide 4 (Ddx4)Deleted in azoospermia-like (Dazl)

RIKEN cDNA 1700001N01LOC235045

Sim. to serine protease inhibitorRIKEN cDNA 1700067I02

LOC245536 (LOC245536), mRNAHematopoietic cell transcript 1

Chr 7 expressed (D7Wsu180e)Sim. to orphan receptor (LOC215448)

Poly(rC) binding ptn. 3 (Pcbp3)Voltage-dep. R-type Ca++ channel -1E

Ataxin 2 binding protein 1Sim. to HuC

Ventral neuron-specific ptn 1 NOVA1Poly(rC) binding ptn 4 (Pcbp4)

LOC217874LOC239368

Zinc finger protein 97RIKEN cDNA 2400008B06

Metal-response element tx factor 2 (Mtf2)LOC231661LOC231903

Related to CG7582 (LOC232810)RIKEN cDNA 1300006E06

Sim. to protease (LOC211700)Hypothetical protein FLJ22774

RIKEN cDNA 5430427O21Nuclear RNA export factor 6 (Nxf6)Sim. to serine protease inhibitor 14Sim. to serine protease inhibitor 13

Hypothetical ZNF protein KIAA0961KIAA0215 gene product

LOC227582Sim. to HMG-BOX tx factor BBX

LOC214566FN5 protein (Fn5)

LOC229850LOC229555

Ribonuclease L (Rnasel)(2-5)oligo(A) synthetase 1A

XM_131066.1XM_124039.1XM_127536.1XM_123141.1XM_125027.1XM_134745.1XM_144364.1XM_132042.1XM_159329.1XM_125337.1XM_124875.1XM_122095.1XM_122063.1XM_123530.1XM_147994.1XM_134734.1XM_138026.1XM_125213.1XM_127170.1XM_139399.1XM_134010.1XM_134886.1XM_132195.1XM_132381.1XM_149717.1XM_133152.1XM_128315.1XM_136425.1XM_139234.1XM_132158.1XM_142153.1XM_147352.1XM_122538.1XM_145503.1XM_135809.1XM_149095.1XM_147194.1XM_150017.1XM_147333.1XM_149402.1XM_130999.1XM_136286.1XM_132373.1

Description Accession

GAPDH

Are functional pathways coordinately regulated?

Compiled annotations from 992 GO “Biological process” categories for 7,779 genes on the array(from EBI and MGI/JAX)

(considered only categories with >3 and <500 genes)

**GO evidence codes (and manual inspection) indicate that very few annotations are based purely on expression

Polyamine biosynthesisOxidative phosphorylationMuscle contraction

Epidermal differentiation

Cell:cell adhesion

Regulation of neurotransmitter levels

Synaptic transmission

Axonogenesis

RNA splicing

CytokinesisMicrotubule-based movement

M phase

Serine biosynthesisPreganancyFertilizationBone remodelingSkeletal development

55 mouse tissues/samples

Gene expression reflects gene function

Ratio over median<1 3 7 >20

microarrays wednesday, march 1, 2006 dr. tim hughes ccbr – 160 college st. – room 1302...

Documents

interpretation slide

microarray data

strand synthesis slide

data clustering

smooth data

cdna microarray analysis

cy3 cy3

cy5 cy5