data preprocessing · data preprocessing. m i c r o a r r a y. n o the microarray pipeline. m i c r...
TRANSCRIPT
Kjell PetersenIntoduction to Microarray technology
September 2010
mic
roarray.n
o
Data Preprocessing
mic
roarray.n
oThe microarray pipeline
mic
roarray.n
oWorkflow microarray experiment
Problem-driven experimental design
RNA labelling
MicroarraysHybridisationWashing Scanning
Image analysisGriddingFeature extraction
Raw data
Wet-lab experiments
1 Data pre-processingFilteringNormalisationTransformationMissing values,…
Gene expression table
2
Secondary data analysisDifferential expressionPattern recognitionFunctional characterisationKnowledge integration
Biological interpretation or discovery
3
mic
roarray.n
oOverview
• Get to know the steps from raw data to the processed gene expression matrix
• These steps
– remove known biases
– make samples more comparable
– Reduce noise in the data set
• Filtering
• Log transformation
• Normalization – “straightning things out”
• Missing Value replacement - “filling in the holes”
mic
roarray.n
o
Convert raw intensities to meaningful biological data;
Summarise image analysis results
Assess quality of resulting data
Remove bias from technical sources
12
3
i
..
.
Image analysis result files (one per sample)
Pre-processing sfdfsdf
sfdfsdfsfdfsdf
sfdfsdf
sfdfsdfsfdfsdf
sfdfsdf
sfdfsdfsfdfsdf
sfdfsdf
sfdfsdfsfdfsdf
sfdfsdf
sfdfsdfsfdfsdf
sfdfsdf
sfdfsdfsfdfsdf
sfdfsdf
sfdfsdfsfdfsdf
sfdfsdf
sfdfsdfsfdfsdf
sfdfsdf
sfdfsdfsfdfsdf
sfdfsdf
sfdfsdfsfdfsdf
sfdfsdf
sfdfsdfsfdfsdf
sfdfsdf
sfdfsdfsfdfsdf
sfdfsdf
sfdfsdfsfdfsdf
sfdfsdf
sfdfsdfsfdfsdf
sfdfsdf
sfdfsdfsfdfsdf
sfdfsdf
sfdfsdfsfdfsdf
sfdfsdf
sfdfsdfsfdfsdf
sfdfsdf
sfdfsdfsfdfsdf
sfdfsdf
sfdfsdfsfdfsdf
sfdfsdf
sfdfsdfsfdfsdf
sfdfsdf
sfdfsdfsfdfsdf
sfdfsdf
sfdfsdfsfdfsdf
sfdfsdf
sfdfsdfsfdfsdf
sfdfsdf
sfdfsdfsfdfsdf
1 2 3 4 . .
Gene expression table
Data pre-processing
mic
roarray.n
o
Tissues / conditions
Genes
Normalised signal (i.e. expression level) of one gene in one tissue
Gene expression table
mic
roarray.n
oPreprocessing vs quality control
• Array level
– Assess each spot and surroundings
– Make plots of arrays before and after preprocessing
• Experiment level
– Comparing all arrays to identify outliers and batch effects
– Make plots of dataset before and after preprocessing
mic
roarray.n
o
Used to remove spots that will bias or add random noise to the results
Low intensity (based on signal to noise ratio) or bad quality (saturated signal, dust, ...) spots in particular
Removal of bad quality spots introduces ”missing values” for some of the genes, these can be estimated from the remaining good quality spots (e.g.LSimpute)
Can be used as quality control; percentage of spots lost indicates overall array quality
Some softwares allow use of weights on spots instead of filtering. This leaves no missing values
Filtering
mic
roarray.n
o
• How good are foreground and background measurements ?– Spot size– Circularity measure– Uniformity – Population outlier– Spot intensity relative to background
• Based on these measurements, one can flag a spot
• Different image softwares have different measures that is uses to flag potential unreliable spots
Quality measures of signals
mic
roarray.n
oCategories of spots to filter
• Controls• Saturated spots• Poor quality (flags)• Too weak spots
• Less common to filter per array
• Can also filter genes later in gene expression matrix
– Variance etc
mic
roarray.n
oLog transformation
• It is preferred to log 2 transform both one channel and two channel data.
• Theoretical reasons: better fundation for statitistics
• Practical reasons:
– Both small and large intensitites visible in the same overview plots (otherwise only a few high intensity genes tend to domicate)
– For ratio data: yields a scale symmetric about 0
mic
roarray.n
o
Variance proportional
to signal
Data are extreme value
distributed
Less variance at higher intensities
Data have distribution that have moved
towards normal distribution (In ideal world perhaps they would be normally
distributed?)
Log transformation
mic
roarray.n
oNormalisation
• Within slide normalisation
– To account for systematic variation in measured dye intensity due to hybridisation preferences
• Between slide normalisation
– To ensure slides are comparable
– Can be performed on both 1 and 2 channel arrays
mic
roarray.n
oNormalisation - Median
• Assumption: Changes roughly symmetric
• First panel: smooth density of log2G and log2R.
• Second panel: M vs A plot with median put to zero
mic
roarray.n
o
Locally Weighted Least Square Regression
Assumption: Variation in data is intensity dependent.
Smoothes the intensity function.
Typically applied to M-A plots
Before After
Lowess normalisation
mic
roarray.n
o
We assume that the intensity distributions across arrays are the same
This is not always (never?) the case
Intensity distributions need to be similar for the arrays to be comparable
Between array normalisation
mic
roarray.n
oQuantile normalisation
mic
roarray.n
oMissing Value Replacement
Find correlatedrows and columns
Borrow informationto estimate missingvalue
LSimpute adaptiveBø et al
<-----------------Samples ------------------>
<--
----
----
----
----
----
--G
e nes
----
----
----
----
----
--- -
>
mic
roarray.n
o
Questions before summary ?
mic
roarray.n
oSummary
• Filtering – remove bad quality spots• Log 2 transformation – variance stabilisation• Normalisation – remove systematic variation
and make arrays comparable• Imputation – estimate missing values• Look at box plot / density plots and scatter
plots before and after preprocessing
mic
roarray.n
oQuestions
mic
roarray.n
oAcknowledgements
• Most slides borrowed from– Christine Stansberg– Anne-Kristin Stavrum