microarrays wednesday, march 1, 2006 dr. tim hughes ccbr – 160 college st. – room 1302...
TRANSCRIPT
MicroarraysWednesday, March 1, 2006
Dr. Tim HughesCCBR – 160 College St. – Room [email protected]
Outline:• Microarray experiments• Normalization • Different types of microarrays• Other applications besides expression profiling• Clustering and interpretation
Suggested reading
Eisen et al., 1998
HARTIGAN, J.A., Clustering Algorithms, Wiley, New York and London (1975).My understanding is that it is no longer in print, but is available on CD.
Jain et al., ACM Computing Surveys, 31(3) 1999 “Data Clustering: a review”.(http://www.amk.alt-neustadt.at/diplom/papers/Clustering/p264-jain.pdf)
Hegde et al., A concise guide to cDNA microarray analysis.Biotechniques. 2000 Sep;29(3):548-50, 552-4, 556
Sherlock G. Analysis of large-scale gene expression data.Curr Opin Immunol. 2000 Apr;12(2):201-5.
www.accessexcellence.org/AB/GG/nucleic.html
Nucleic Acid Hybridization
Microarray expression profiling by 2-color assay (“cDNA arrays”)
Array: PCR products6250 yeast ORFs
hybridized cDNAs:green = controlred = experiment
*Schena et al., 1995
“cDNA microarrays” are essentially dot-blots on glass slides
http://arrayit.com/Products/Printing/Stealth/stealth.html
• This slide was made with 16 pins• 4.5 mm pin spacing matches 384-well plates (16 x 24)• Done with robotics• Slides usually coated with poly-lysine• Spots are usually 100-150 microns• Spot spacing is usually 200-300 microns.• Slides are 25 x 75 mm• Easy to deposit 20K spots/slide
0.45 mm
Common ways to “label” nucleic acids
Random priming of double-stranded DNA:
Poly-T primed cDNA synthesis:
Direct labelling (fluors only):
Amplification:*
*
*
Reaction contains labelled nucleotides
* *
AAAAAAAA
AAAAAAAA
TTTTTTTTTT
Reaction contains labelled nucleotides
*
AAAAAAAA
AAAAAAAA
TTTTTTTTTT-T7 promoter
TTTTTTTTTT-T7 promoter
AAAAAAAA-T7 promoter
* ** *
*
T7 reaction contains labelled nucleotides
“second strand” synthesis
controltreatment
(drug, mutation)
updownunchangednot present
x y z
xx
x
xx
yy
yy
zz z
cDNA pools
Typical use of cDNA microarrays:“Internal” normalization using two colors
Excitation
Emission
532 nm laser (green) excites Cy3Cy3 detected with an emission filter that passes 557-592 nm
635 nm (red) excites Cy5Cy5 detected with an emission filter that passes 650-690 nm.
Both are detected by a photomultiplier tube.
http://www.jacksonimmuno.com/2001site/home/catalog/f-cy3-5.htm
Cy3NHS Ester
Cy5NHS Ester
http://www.ope-tech.com/doc/Cy5_structure.htm
The primary data: two grayscale TIFF files
http://www.axon.com/GN_GenePix4000.html
Cy3 channel(“green”)
Cy5 channel(“red”)
Image processing and normalization: what is microarray data?Microarray data is summary information from image files that come out of the scanner.Image processing: line up grids, flag bad spots, quantitate.
Looking at data from a single experiment
3-AT vs.No drug
wild-type vs.wild-type
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Log10
(Intensity)
Log 1
0(Exp
ress
ion
Rat
io)
Slides: 11120c01 -11121c01
P-value < 0.01
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
P-value < 0.01
Log10
(Intensity)
Log 1
0(Exp
ress
ion
Rat
io)
Slides: 11857c01 -11858c01
log10(average intensity)
-2 -1 0 1 2
log 1
0(r
atio
)lo
g 10(r
atio
)
2
1
0
-1
-2
-2 -1 0 1 2
2
1
0
-1
-2
http://www.mathworks.com/access/helpdesk/help/toolbox/curvefit/ch_data7.html
Lowess smoothing: The names "lowess" and "loess" are derived from the term "locally weighted scatter plot smooth," as both methods use locally weighted linear regression to smooth data.
Find spotsManual editQuantitateNormalize (“Lowess smoothing”)(Locally weighted scatterplot smoothing) Confirm spots outside envelopeSave data, images, spot map
Selected tricks for processing and normalization
(1) High-pass spatial detrending. See: O. Shai, Q. Morris, and B.J. Frey, (2003) Spatial Bias Removal in Microarray Images, University of Toronto Technical Report PSI-2003-21, http://www.psi.utoronto.ca/~ofer/detrendingReport.pdf
(2) VSN – “Variance Stabilizing Normalization”. See:
Huber, W., Von Heydebreck, A., Sultmann, H., Poustka, A., & Vingron, M. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18, Suppl 1, S96-S104 (2002).
Q. Morris, B. Frey, O. Shai
Other types of arrays
Photolithographic arrays (Affymetrix)
Building up oligonucleotides on a surface:
http://www.affymetrix.com/technology/manufacturing/index.affx
Photolithographic arrays (Affymetrix)
aka “GeneChip”
Arrays are typically 25-mers, with “mismatch” control for specificity
Photolithographic arrays (Affymetrix)Advantages:
Density is limited essentially by the 5 micron resolution of scanners (solution: larger arrays).
Well-developed protocols.
“Industry standard” (largely self-driven).
Disadvantages:
Not all probes work well. Affymetrix has evolved a complicated system to compensate for this, but even “believers” use at least four probes per gene, and usually more.
Single color.
Sample preparation typically requires amplification.
Single supplier; historically intellectual property issues. (i.e. comparisons)
• 25,000 oligos / 1 x 3 inches
• Sequence completely flexible
• 60-mers
G
AGTC
A
CGGG
C
TGAA
Ink-jet arrays (Agilent)
Hughes TR et al. Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer. Nat Biotechnol. 2001 Apr;19(4):342-7.
Ink-jet arrays generally agree with spotted cDNA arrays
Yeast IJS array: ~8 oligos per gene Spo vs. SC
cDNA array
mu
ltip
le o
ligos
cDNA array
sin
gle
olig
o
r = 0.96
HXT3 HXT1
HXT4
r = 0.97
Ink-jet arrays (Agilent)Advantages:
User-specified sequences; “no questions asked”
Sensitivity and specificity are defined and exceed requirement for most expression profiling applications; no amplification required
Virtually every 60-mer is functional
Data correlates well with spotted cDNA arrays
Disadvantages:
Density currently limited to ~45,000 spots per array.
Single supplier (although a protocol is in press for making your own synthesizer!)
“Maskless” arrays (Nimblegen)
http://www.nimblegen.com/technology/manufacture.html
Advantages:
User-specified sequences.
Density is limited essentially by the 5 micron resolution of scanners.
Disadvantages:
New to arena. Performance in initial publication (Nuwaysir et al., Genome Research, 2002) suggests that sensitivity and specificity may be lower than that of Agilent arrays.
Single supplier – although all the parts are there for academics to build one.
Possible IP issues. Hybs are done in Iceland to bypass Affy IP. Nimblegen web site boasts of new partnership with Affymetrix.
“Maskless” arrays (Nimblegen)
Applications beyond expression profiling
• DNA copy number
• Genotyping
• Protein-DNA associations
• Molecular “Barcoding”
• Protein arrays
• Transformation arrays
Identifying DNA binding sites
Science 2000 Dec 22;290(5500):2306-9 Genome-wide location and function of DNA binding proteins.
Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, Volkert TL, Wilson CJ, Bell SP, Young RA.
Analysis of multiple experiments
• Comparisons
• Clustering
• Predicting gene functions
• Finding promoter elements
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5
-1
-0.5
0
0.5
1
1.5
log10(ratio), vma8 / WT
log
10(
ratio
), c
up5
/ W
T
VMA8
CUP5
r = 0.88
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5
-1
-0.5
0
0.5
1
1.5
CUP5
MRT4
r = 0.09
log
10(
ratio
), c
up5
/ W
Tlog10(ratio), mrt4 / WT
Comparing data from two experiments
-1.5 -1 -0.5 0 0.5 1.0 1.5-1.5 -1 -0.5 0 0.5 1.0 1.5
1.5 1
0.5 0
-0.5
-1.0
-1.5
1.5 1
0.5 0
-0.5
-1.0
-1.5
The behavior of two genes over many experiments can be compared in the same fashion
scatter plot of ratios (intensity not displayed)
2-D clusteringStep 1: cut experiments and transcripts
falling below P-value and ratio thresholds
-10 -5 -2 1 2 5 10
fold repression fold induction
transcript response index
exp
erim
ent
ind
ex
44 experimentsx
407 genes
2-D clustering
-10 -5 -2 1 2 5 10
fold repression fold induction
Step 2: cluster experiments and transcriptstranscript response index
exp
erim
ent
ind
ex
RHO O/XPKC O/X
ste mutants
treatment withalpha-factor
Data from Roberts et al., Science (2000)
K = 10 #1 #2 #3
There are many types of clustering. One example: K-means (must choose K)
See: Sherlock G. Analysis of large-scale gene expression data.Curr Opin Immunol. 2000 Apr;12(2):201-5.
Basics of clustering freeware: Eisen’s “Cluster” and “Treeview”
Mike Eisen's web site: rana.lbl.gov/EisenSoftware.htm
“Cluster” loads an Excel file (save as tab-delimited text) in the following format:
Cluster
Treeview
(also: “TreeArrange” - http://monod.uwaterloo.ca/downloads/treearrange/)There are also many commercial programs available.
mRNA
protein
nucleus
cell
Microarray expression data
Co-regulated groups of genes
Functional categories
Predict functions of new genes
cis, trans regulators
Cluster labelamino acid metabolismarginine biosynthesisarginine catabolismaromatic AA metabolismasparagine biosynthesisbranched chain AA synthlysine biosynthesismethionine biosynthesissulfur AA tnsprt, metabadenine biosynthesisaldehyde metabolismbiotin biosynthesiscitrate metabolismergosterol biosynthesisfatty acid biosynthesisgluconeogenesisNAD biosynthesisone-carbon metabolismpyridoxine metabolismthiamin biosynthesis 1thiamin biosynthesis 2hexose transportsodium ion transportpolyamine transportnucleocytoplasmic transportribosome/RNA biogenesisribosomal proteinstranslational elongationprotein foldingsecretionprotein glycosylationvesicle-mediated transportproteasomevacuole fusionmitoribosome/respirationMitochond. electron trans.iron transport/TCA cycleChromatin/transcriptionhistonesMCM2/3/6/CDC47DNA replicationmitotic cell cycleCLB1/CLB6/BBP1cytokinesisdevelopmentpheromone responseconjugationsporulation/meiosisresponse to oxidative stressstress/heat shock
Sample genesTRP4, HIS3ARG1, ARG3CAR1, CAR2ARO9, ARO10ASN1, ASN2ILV1,2,3,6LYS2, LYS9MET3,16,28MUP1, MHT1ADE1,4,8AAD4,14,16BIO3,4CIT1,2ERG1,5,11FAS1,FAS2PGK1, TDH1,2,3BNA4,6GCV1,2,3SNO1, SNZ1THI5,12THI2,20HXT4,GSY1ENA1,2,5TPO2,3KAP123,NUP100MAK16,CBF5RPS1A,RPL28TEF1,2SSA1,HSP60VTH1,KRE11ALG6,CAX4VPS5,IMH1RPN6,RPT5VTC1,3,4,PHO84MRPL1,MRPS5ATP1,COX4FRE1,FET3SNF2,CHD1,DOT6HTA1,HHF1MCM2,3,6RFA1,POL12SPC110,CIN8CLB1,6CTS1,EGT2PAM1,GIC2FUS3,FAR1CIK1,KAR3SPO11,SPO19GDH3,HYR1 HSP104,SSA4
Candidate regulatorGCN4ARG80/81ARG80/81/UME6/RPD3ARO80GCN4/HAP1/HAP2LEU3, GCN4LYS14CBF1, MET28, MET32MET31,MET32BAS1, BAS2, GCN4
RTG3ECM22/UPC2INO4GCR1
THI2/THI3THI2/THI3GCR1NRG1,MIG1HAA1RRPE-binding factorPAC/RRPE-binding factors
HAC1,ROX1RLM1XBP1
RPN4PHO4
HAP2/3/4/5MAC1/RCS1/AFT1/PDR1/3
HIR1,HIR2ECBMCBHCM1FKH1ACE2,SWI4
MATALPHA2,STE12KAR4NDT80ROX1,MSN2,MSN4MSN2,MSN4
249
gen
es1,
226
gen
esNon-overlapping yeast gene expression
clusters424 experiments
Chua et al., 2004
Analyzing clusters:
amino acid biosynthesis (p<10-14)amino acid metabolism (p<10-14)
methionine metabolism (p=1.07×10-7)
Some web resources for promoter analysis:
YRSA (http://forkhead.cgb.ki.se/YRSA/define1.htm)AlignACE (http://atlas.med.harvard.edu/cgi-bin/fullanalysis.pl)
**http://area51.med.utoronto.ca/FUNSPEC.html
GO-Biological Process categories
Broad
Mid-level
Narrow eye pigment metabolism
eye morphogenesis
pigment metabolism
striated muscle contraction
ATP biosynthesis
vision
CNS development
insulin secretion
Very Broadmetabolism
163
137
21
36
25
33
34
1548
# annotated genes(mouse)
development 2341
GO-Biological Process hierarchy
eye pigment metabolism
eye morphogenesis
pigment metabolism
CNS development
metabolism
development
Other types of categorical annotations:
KEGG, EC numbers (describe biochemical “pathways”)
MIPS, YPD (yeast databases – older than GO)
Results of individual studies (localization, 2-hybrid screens, protein complexes, etc.
Sequence motifs, structural domains (pfam, SMART)
Other people’s microarray clusters
etc.
**When testing clusters against many different types of categorical annotations, should consider correcting for multiple-testing, and also consider that categories are often not independent
mRNA
protein
nucleus
cell
Big questions:
To what degree are functional pathways coordinately regulated?
What controls the observed regulations?
Exploring mouse gene expression using Inkjet Oligonucleotide Arrays
• 22,000 oligos / 1 x 3 inches
• Sequence completely flexible
• Mouse “42K” array: NCBI GenomeScan predictions (“XM”) on mouse draft sequence G
AGTC
A
CGGG
C
TGAA
**Wen Zhang
• Includes:25K with cDNA
(75% of 18K RefSeq genes)30K with cDNA or EST12K potential new genes
Exploring mouse gene expression using Inkjet Oligonucleotide Arrays
Collect 55 different mouse tissues from experts:
Janet RossantJane AubinDerek van der KooyMichael FehlingsBenoit Bruneau**
**Wen Zhang
Analyze mRNA levels on arrays(1 g poly-A)
Analysis of 55 mouse tissues: QC
**Malina Bakowski, Blencowe lab
Unch
ar.
cDNA
EST
Gen
e tra
pTr
ansc
riptio
n fa
ctor
RNA
bind
ing/
RS d
omai
n
Test
isO
lfact
ory
bulb
Brai
nEy
eES Sk
el. M
uscle
Live
rFe
mur
Teet
hPl
acen
taPr
osta
teLy
mph
nod
eSp
leen
Digi
tTo
ngue
Trac
hea
Larg
e in
test
ine
Colo
n
Test
isO
lfact
ory
bulb
Brai
nEy
eES Sk
el.l
Mus
cleLi
ver
Fem
urTe
eth
Plac
enta
Pros
tate
Lym
ph n
ode
Sple
enDi
git
Tong
ueTr
ache
aLa
rge
inte
stin
eCo
lon
Hypothetical protein FLJ20519Testis nuclear RNA binding ptn (Tenr)
DEAD box polypeptide 4 (Ddx4)Deleted in azoospermia-like (Dazl)
RIKEN cDNA 1700001N01LOC235045
Sim. to serine protease inhibitorRIKEN cDNA 1700067I02
LOC245536 (LOC245536), mRNAHematopoietic cell transcript 1
Chr 7 expressed (D7Wsu180e)Sim. to orphan receptor (LOC215448)
Poly(rC) binding ptn. 3 (Pcbp3)Voltage-dep. R-type Ca++ channel -1E
Ataxin 2 binding protein 1Sim. to HuC
Ventral neuron-specific ptn 1 NOVA1Poly(rC) binding ptn 4 (Pcbp4)
LOC217874LOC239368
Zinc finger protein 97RIKEN cDNA 2400008B06
Metal-response element tx factor 2 (Mtf2)LOC231661LOC231903
Related to CG7582 (LOC232810)RIKEN cDNA 1300006E06
Sim. to protease (LOC211700)Hypothetical protein FLJ22774
RIKEN cDNA 5430427O21Nuclear RNA export factor 6 (Nxf6)Sim. to serine protease inhibitor 14Sim. to serine protease inhibitor 13
Hypothetical ZNF protein KIAA0961KIAA0215 gene product
LOC227582Sim. to HMG-BOX tx factor BBX
LOC214566FN5 protein (Fn5)
LOC229850LOC229555
Ribonuclease L (Rnasel)(2-5)oligo(A) synthetase 1A
XM_131066.1XM_124039.1XM_127536.1XM_123141.1XM_125027.1XM_134745.1XM_144364.1XM_132042.1XM_159329.1XM_125337.1XM_124875.1XM_122095.1XM_122063.1XM_123530.1XM_147994.1XM_134734.1XM_138026.1XM_125213.1XM_127170.1XM_139399.1XM_134010.1XM_134886.1XM_132195.1XM_132381.1XM_149717.1XM_133152.1XM_128315.1XM_136425.1XM_139234.1XM_132158.1XM_142153.1XM_147352.1XM_122538.1XM_145503.1XM_135809.1XM_149095.1XM_147194.1XM_150017.1XM_147333.1XM_149402.1XM_130999.1XM_136286.1XM_132373.1
Description Accession
GAPDH
Are functional pathways coordinately regulated?
Compiled annotations from 992 GO “Biological process” categories for 7,779 genes on the array(from EBI and MGI/JAX)
(considered only categories with >3 and <500 genes)
**GO evidence codes (and manual inspection) indicate that very few annotations are based purely on expression
Polyamine biosynthesisOxidative phosphorylationMuscle contraction
Epidermal differentiation
Cell:cell adhesion
Regulation of neurotransmitter levels
Synaptic transmission
Axonogenesis
RNA splicing
CytokinesisMicrotubule-based movement
M phase
Serine biosynthesisPreganancyFertilizationBone remodelingSkeletal development
55 mouse tissues/samples
Gene expression reflects gene function
Ratio over median<1 3 7 >20