geneexpression ii: 1. transcription factor binding sites 2. microarrays 26 th may, 2010
DESCRIPTION
GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th May, 2010 Karsten Hokamp Genetics Department. TFBS prediction - Overview. Introduction Methods Implementations Analyse 2kb upstream of eve. TFBS prediction - Introduction. TFBS = DNA motifs - PowerPoint PPT PresentationTRANSCRIPT
GeneExpression II:1. Transcription Factor Binding Sites
2. Microarrays
26th May, 2010
Karsten HokampGenetics Department
GeneExpression II 1BI2010
TFBS prediction - Overview
• Introduction
• Methods
• Implementations
• Analyse 2kb upstream of eve
GeneExpression II 2BI2010
TFBS prediction - Introduction
• TFBS = DNA motifs = 5 – 20 bp long
= variable = multiple occurrences/sites per gene = combination of activators and repressors
• cis-regulatory regions = clusters of TFBS -20kb – first intron
GeneExpression II 3BI2010
TFBS prediction - Introduction
GeneExpression II 4BI2010
Example: MSE2 strip for eve (D. melanogaster):
(Janssens et al., 2006)
understand transcriptional regulation infer regulatory networks
TFBS prediction - Methods
• De novo motif prediction (overrepresentation)
• Searching for known motifs
• Phylogenetic Footprinting/Shadowing
• Clustering of TFBSs
• Integration of external data sources
(co-expression, structure)
GeneExpression II 5BI2010
GeneExpression II 6BI2010
TFBS prediction - Overview
Hannenhalli (2008, Bioinformatics)Hannenhalli (2008, Bioinformatics)
De novo motif prediction
• Search for over-represented motifs
• Frequency count
• Works well for yeast and prokaryotes
• Not so successful in higher organisms
GeneExpression II 7BI2010
Using motif databases
• Search for known motifs• Position specific scoring matrix (PSSM) or
Position weight matrix (PWM)• Databases:
– Transfac– Jasper
GeneExpression II 8BI2010
Phylogenetic-based methods
• Search for islands of highly conserved regions• Footprinting: elements conserved across
distant species• Shadowing: elements conserved between
closely related species• Pros: increases specificity• Cons: conservation is not sufficient nor
necessary
GeneExpression II 9BI2010
Practical:
• Try some tools on 2kp upstream sequence of D. melanogaster eve and compare with published results.– Alibaba (de novo)– Match (Tranfac)– Meme (de novo)– Promo (Tranfac)– WeederH (phylogenetic footprinting)
GeneExpression II 10BI2010
Other tools:
• Many more tools available for download:– Sombrero– FootPrinter– PhyloGibbs
• Other Web-tools for groups of co-regulated genes:– RSAT– NestedMICA– WebMOTIFS
GeneExpression II 11BI2010
TFBS prediction - Conclusion:
• No single tool gives accurate results
• Combination of predictions from multiple tools might increase specificity
• Incorporate additional information for greater precision
GeneExpression II 12BI2010
Microarrays - Overview
• Introduction• Data Generation• Data Characteristics• Diagnostic Plots• Preprocessing• Statistical Analysis
GeneExpression II 13BI2010
GeneExpression II 14
What is a microarray?
• A solid support onto which the sequences from thousands of different genes are immobilized
• Different probe types- short oligonucleotides- long oligonucleotides- cDNA
• Different array supports- glass slide- nylon membrane- silicon chip
• Each probe measures the expression of a single transcript
BI2010
GeneExpression II 15
Microarrays – How do they work?
+
uninfected cells infected cells
Affymetrix Arrays : single colour
RNA
Reverse transcriptionLabel with dye
cDNA
Hybridize
Slide A Slide B
BI2010
GeneExpression II 16
Microarrays – How do they work?
Prepare Sample
+
uninfected cells infected cells
Spotted Arrays : two colour
Prepare Microarray
Hybridize target to microarray
BI2010
GeneExpression II 17
Microarray: Subgrids
• One pin per subgrid (printTip group, stratus)
BI2010
Microarrays – Data Extraction
• How to get data from the slides into the computer?
GeneExpression II 18BI2010
Data Extraction – Scanning
GeneExpression II 19
ScannerSlide
PRMS02-001-S100
CF010settings: - laser power - sensitivity - focus
Images (TIFF)
channel 1 (green) channel 2 (red) composite (green, yellow, red)
BI2010
Data Extraction – Quantification
GeneExpression II 20
align grid,align grid,tag unreliable spotstag unreliable spots
program assigns program assigns numbers numbers
representing representing intensity of spotintensity of spot
Software:
-ImaGene
-GenePix
-ScanAlyze
...
Spot ID FG CH1
BG CH1
FG CH2
BG CH2
FL
GFP 1241 671 6707 713 1
PA0080 570 495 599 384 0
PA0080 691 632 667 651 0
PA0122 703 610 653 619 0
PA0122 708 598 695 602 0
.. … … … … …
Data File
foreground (FG)background (BG)
BI2010
Quantification: Intensity Range
GeneExpression II 21
- area composed of pixel- value range: 0 – 216 - 1- value range: 0 – 65535- saturation possible- low intensities = noise
BI2010
Data Generation – Summary
• RNA labelling and hybridization• Array Scanning• One image per channel• Load into quantification software• Flag flawed spots• Extract values• Text file with FG and BG intensities (per probe)
GeneExpression II 22BI2010
GeneExpression II 23
Cy3
Cy5
Cy5-cDNA
Cy3-cDNA
RT
RT
cDNAarray
Cy5 intensity
Cy3 intensity
Sample2 mRNA
Sample1 mRNA
wavelength dependent
intensity dependent
uneven hybridization gel
print-tip variations
background variations
image processing algorithm-dependent
systematic experimental error
.tiff Image Files
Raw Data File
Microarrays – Sources of Variation
source: www.tigr.org
BI2010
Microarrays – Sources of Variation
• Technical:– labelling– hybridization– slide quality– scanning– print-tip effect– quantification– experimenter
GeneExpression II 24
• Biological:– individual/strain/sample– environment– time point
BI2010
Microarrays – Data Characteristics
• Intensities vs. ratios• Natural scale vs. log scale
GeneExpression II 25BI2010
Intensities vs. Ratios
• Intensities:
GeneExpression II 26
ch1 ch2
gene1 517 2100
gene2 3200 13000
gene3 3200 800
gene4 12000 3000
ratio = ch2 / ch1
BI2010
Intensities vs. Ratios
• Ratios:
GeneExpression II 27
ch1 ch2 ratio
gene1 517 2100 4.06
gene2 3200 13000 4.06
gene3 3200 800 0.25
gene4 12000 3000 0.25
ratio = ch2 / ch1
> 0
ratio = 1 if ch1 = ch2
BI2010
Intensities vs. Ratios
• Ratios– convey expression changes– hide base level differences
• But: absolute changes can be important, too!
GeneExpression II 28BI2010
Graphical Representation: Signal Scatter Plot
GeneExpression II 29
X CH1: Cy3
Y
CH
2: C
y5
3000
18000
3000 18000
ch1 ch2
spot1 517 2100
spot2 3200 13000
spot3 3200 800
spot4 12000 3000
ratio = 1
BI2010
Graphical Representation: Signal Scatter Plot
GeneExpression II 30
CH1: Cy3
CH
2: C
y5
ratio = 1
~ 10x
BI2010
Graphical Representation: Histogram
GeneExpression II 31
ratios1
Ratios
Fre
qu
ency
BI2010
Raw vs. Log ratios
• Log transformation
GeneExpression II 32
raw log
0.1 -3.3
0.5 -1
1 0
2 1
10 3.3
x = basey
8 = 23
0.125 = 2-3
y undefined for x <= 0
x = 2y
ratios
BI2010
Log ratios: scatter plot
GeneExpression II 33
CH1: Cy3
CH
2: C
y5
ratio = 1
CH1: log2(Cy3)
CH
2: l
og
2(C
y5)
log-ratio = 0
BI2010
Log ratios: histogram
GeneExpression II 34
ratios1
Ratios
Fre
qu
ency
Log-ratios
BI2010
Microarrays – Data Characteristics
• ratios vs. intensities– convey expression changes– hide base level differences
• log ratios vs. raw ratios– reduce spread– provide symmetry
GeneExpression II 35BI2010
Diagnostic plots
• histogram• scatter plot• box plot• MA plot• chip visualization
GeneExpression II 36BI2010
Diagnostic plots – Histogram
GeneExpression II 37
good bad
log(CH1) log(CH2)
freq
uenc
y
BI2010
Diagnostic plots – Scatter plot
GeneExpression II 38
o.k. bad
BI2010
Diagnostic plots – MA plot
• Rotate scatter plot by ~ 45 degree:
GeneExpression II 39BI2010
Diagnostic plots – MA plot
• Rotate scatter plot by ~ 45 degree:
GeneExpression II 40BI2010
Diagnostic plots – MA plot
• Mathematically:
GeneExpression II 41
= log2(R) – log2(G)
= 0.5 * ( log2(R) + log2(G) )
Minus
Addition
BI2010
Diagnostic plots – MA plot
GeneExpression II 42
A
M
BI2010
2-fold cut-off
GeneExpression II 43BI2010
2-fold cut-off
GeneExpression II 44BI2010
2-fold cut-off
GeneExpression II 45BI2010
GeneExpression II 46
Cy3
Cy5
Cy5-cDNA
Cy3-cDNA
Unequal labeling efficiency
M =
lo
g(R
/G)
A = ½ log(RG)
Dye Swap
Strong bias towards Cy3!
Cy5
Cy3
BI2010
GeneExpression II 47
Dye Swap
+
uninfected cells infected cells
cDNA
+
uninfected cells infected cells
cDNA
Merged Data set
Cy5 vs Cy3 Cy3 vs Cy5
BI2010
GeneExpression II 48
A = ½ log(RG)
Cy3
Cy5
Cy5-cDNA
Cy3-cDNA
Unequal labeling efficiency
Dye SwapM
= l
og
(R/G
)
A = ½ log(RG)
BI2010
Diagnostic plots – Box plot
GeneExpression II 49
[median
lower quartile
upper quartile
Inter-quartile range
whiskers
1.5 times inter-quartile range
[
outliers
BI2010
Diagnostic plots – Box plot
GeneExpression II 50
o.k. bad
BI2010
Diagnostic plots – Box plot (printtip)
GeneExpression II 51BI2010
Diagnostic plots – Chip visualization
GeneExpression II 52
good:
bad:
BI2010
Diagnostic plots: Summary
• histogram– data distribution (intensities, ratios)
• scatter plot– dye effect, print-tip effect
• box plot– equal average ratio and distribution, print-tip effect
• MA plot– dye effect and intensity-dependant ratio
• chip visualization– spatial bias, scratches, bubbles, smears
GeneExpression II 53BI2010
Microarrays – Preprocessing
• Flagging• Background correction• Normalization• Flawed slides: Discard and repeat
GeneExpression II 54BI2010
Microarrays – Flagging
• Skip or keep (but warn)• e.g. skip low intensities and saturated spots
GeneExpression II 55BI2010
Microarrays – Background correction
• Subtract background measurements from foreground intensities
• Brings intensities lower to zero, increases ratios:
example spot with five fold upregulation: 500 / 100 = 5
subtract background (50) from both channels 450 / 50 = 9• Additional source of variance!
GeneExpression II 56BI2010
Microarrays – Normalization
• Remove effect from intensities, dye bias, spatial bias or print-tip variations:– Global mean, median– Loess, lowess– Print-tip loess– 2D loess– Variance stabilazation (VSN)
GeneExpression II 57BI2010
Microarrays – Normalization
GeneExpression II 58
rawGlobal meanLOESS
A
M
printTip LOESS
BI2010
Microarrays – Normalization
GeneExpression II 59
rawglobal meanLOESSprintTip LOESS
BI2010
Microarrays – Discard and repeat
• Some slides turn out to be uncorrectable and need to be repeated (unless a sufficient number of replicates remains).
• Remember: bad data in = bad data out!
GeneExpression II 60BI2010
Microarrays – Statistical Analysis
• Replicates• Variation• t-tests• multiple-testing correction• gene lists
GeneExpression II 61BI2010
Statistical Analysis – Replicates
• Two types of repeats• Technical:
– multiple copies of probes on array– multiple repeats of hybridiztion (same RNA)
• Biological:– multiple hybridizations with RNA from multiple
extractions
GeneExpression II 62BI2010
Need replicates to measure variation!
Statistical Analysis – Variation
• Biological variation different from technical• Statistically incorrect to mix• Important consideration for repeats:
High confidence in results fora) one sample/patient/colonyb) group of samples/patients/colonies
GeneExpression II 63BI2010
Prioritise biological repeats!
Statistical Analysis – t-tests
Different classes of samples:- find genes that are affected by a
treatment- p-value = degree of evidence- H0: expression does not change
- t-test requires at least 2 replicates provides p-value for each gene
GeneExpression II 64BI2010
Statistical Analysis – multiple-testing correction
Carrying out t-tests on 10,000 genes average of 500 will have p-value <= 0.05
Methods for multiple testing:Bonferroni (very strict)Benjamini-Hochberg false-discovery rate (FDR)
GeneExpression II 65BI2010
Statistical Analysis – Gene lists
• List of good candidate genes to follow up• FP vs FN• Fold-change vs p-value
Choice depends on downstream analysis
Input for downstream analysis: Clustering, pathway analysis, enrichment, etc.
GeneExpression II 66BI2010
Analysis tools
• Stand-alone tools:– R– BioConductor– ArrayNorm– TM4– GeneSpring (commercial)
• Web-based tools– ArrayPipe– ExpressYourself– GenePublisher– GEPAS– GeneTraffic (commercial)
GeneExpression II 67BI2010
Public Repositories
• ArrayExpress– EBI, MIAME-compliant
• Gene Expression Omnibus (GEO)– NCBI– „world‘s first write-only database“
GeneExpression II 68BI2010
Summary
GeneExpression II 69
• Many sources of variance• Large numbers of replicates required for reliable
results• Data: be aware of flaws/bias• Flagging/discarding results in data loss• Correction often possible but can insert artifacts
• However:
Microarrays can still help making great discoveries!
BI2010
END
GeneExpression II 70BI2010