data preprocessing · data preprocessing. m i c r o a r r a y. n o the microarray pipeline. m i c r...

22
Kjell Petersen Intoduction to Microarray technology September 2010 microarray.no Data Preprocessing

Upload: others

Post on 29-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Preprocessing · Data Preprocessing. m i c r o a r r a y. n o The microarray pipeline. m i c r o a r r a y. n o Workflow microarray experiment Problem-driven experimental design

Kjell PetersenIntoduction to Microarray technology

September 2010

mic

roarray.n

o

Data Preprocessing

Page 2: Data Preprocessing · Data Preprocessing. m i c r o a r r a y. n o The microarray pipeline. m i c r o a r r a y. n o Workflow microarray experiment Problem-driven experimental design

mic

roarray.n

oThe microarray pipeline

Page 3: Data Preprocessing · Data Preprocessing. m i c r o a r r a y. n o The microarray pipeline. m i c r o a r r a y. n o Workflow microarray experiment Problem-driven experimental design

mic

roarray.n

oWorkflow microarray experiment

Problem-driven experimental design

RNA labelling

MicroarraysHybridisationWashing Scanning

Image analysisGriddingFeature extraction

Raw data

Wet-lab experiments

1 Data pre-processingFilteringNormalisationTransformationMissing values,…

Gene expression table

2

Secondary data analysisDifferential expressionPattern recognitionFunctional characterisationKnowledge integration

Biological interpretation or discovery

3

Page 4: Data Preprocessing · Data Preprocessing. m i c r o a r r a y. n o The microarray pipeline. m i c r o a r r a y. n o Workflow microarray experiment Problem-driven experimental design

mic

roarray.n

oOverview

• Get to know the steps from raw data to the processed gene expression matrix

• These steps

– remove known biases

– make samples more comparable

– Reduce noise in the data set

• Filtering

• Log transformation

• Normalization – “straightning things out”

• Missing Value replacement - “filling in the holes”

Page 5: Data Preprocessing · Data Preprocessing. m i c r o a r r a y. n o The microarray pipeline. m i c r o a r r a y. n o Workflow microarray experiment Problem-driven experimental design

mic

roarray.n

o

Convert raw intensities to meaningful biological data;

Summarise image analysis results

Assess quality of resulting data

Remove bias from technical sources

12

3

i

..

.

Image analysis result files (one per sample)

Pre-processing sfdfsdf

sfdfsdfsfdfsdf

sfdfsdf

sfdfsdfsfdfsdf

sfdfsdf

sfdfsdfsfdfsdf

sfdfsdf

sfdfsdfsfdfsdf

sfdfsdf

sfdfsdfsfdfsdf

sfdfsdf

sfdfsdfsfdfsdf

sfdfsdf

sfdfsdfsfdfsdf

sfdfsdf

sfdfsdfsfdfsdf

sfdfsdf

sfdfsdfsfdfsdf

sfdfsdf

sfdfsdfsfdfsdf

sfdfsdf

sfdfsdfsfdfsdf

sfdfsdf

sfdfsdfsfdfsdf

sfdfsdf

sfdfsdfsfdfsdf

sfdfsdf

sfdfsdfsfdfsdf

sfdfsdf

sfdfsdfsfdfsdf

sfdfsdf

sfdfsdfsfdfsdf

sfdfsdf

sfdfsdfsfdfsdf

sfdfsdf

sfdfsdfsfdfsdf

sfdfsdf

sfdfsdfsfdfsdf

sfdfsdf

sfdfsdfsfdfsdf

sfdfsdf

sfdfsdfsfdfsdf

sfdfsdf

sfdfsdfsfdfsdf

sfdfsdf

sfdfsdfsfdfsdf

sfdfsdf

sfdfsdfsfdfsdf

1 2 3 4 . .

Gene expression table

Data pre-processing

Page 6: Data Preprocessing · Data Preprocessing. m i c r o a r r a y. n o The microarray pipeline. m i c r o a r r a y. n o Workflow microarray experiment Problem-driven experimental design

mic

roarray.n

o

Tissues / conditions

Genes

Normalised signal (i.e. expression level) of one gene in one tissue

Gene expression table

Page 7: Data Preprocessing · Data Preprocessing. m i c r o a r r a y. n o The microarray pipeline. m i c r o a r r a y. n o Workflow microarray experiment Problem-driven experimental design

mic

roarray.n

oPreprocessing vs quality control

• Array level

– Assess each spot and surroundings

– Make plots of arrays before and after preprocessing

• Experiment level

– Comparing all arrays to identify outliers and batch effects

– Make plots of dataset before and after preprocessing

Page 8: Data Preprocessing · Data Preprocessing. m i c r o a r r a y. n o The microarray pipeline. m i c r o a r r a y. n o Workflow microarray experiment Problem-driven experimental design

mic

roarray.n

o

Used to remove spots that will bias or add random noise to the results

Low intensity (based on signal to noise ratio) or bad quality (saturated signal, dust, ...) spots in particular

Removal of bad quality spots introduces ”missing values” for some of the genes, these can be estimated from the remaining good quality spots (e.g.LSimpute)

Can be used as quality control; percentage of spots lost indicates overall array quality

Some softwares allow use of weights on spots instead of filtering. This leaves no missing values

Filtering

Page 9: Data Preprocessing · Data Preprocessing. m i c r o a r r a y. n o The microarray pipeline. m i c r o a r r a y. n o Workflow microarray experiment Problem-driven experimental design

mic

roarray.n

o

• How good are foreground and background measurements ?– Spot size– Circularity measure– Uniformity – Population outlier– Spot intensity relative to background

• Based on these measurements, one can flag a spot

• Different image softwares have different measures that is uses to flag potential unreliable spots

Quality measures of signals

Page 10: Data Preprocessing · Data Preprocessing. m i c r o a r r a y. n o The microarray pipeline. m i c r o a r r a y. n o Workflow microarray experiment Problem-driven experimental design

mic

roarray.n

oCategories of spots to filter

• Controls• Saturated spots• Poor quality (flags)• Too weak spots

• Less common to filter per array

• Can also filter genes later in gene expression matrix

– Variance etc

Page 11: Data Preprocessing · Data Preprocessing. m i c r o a r r a y. n o The microarray pipeline. m i c r o a r r a y. n o Workflow microarray experiment Problem-driven experimental design

mic

roarray.n

oLog transformation

• It is preferred to log 2 transform both one channel and two channel data.

• Theoretical reasons: better fundation for statitistics

• Practical reasons:

– Both small and large intensitites visible in the same overview plots (otherwise only a few high intensity genes tend to domicate)

– For ratio data: yields a scale symmetric about 0

Page 12: Data Preprocessing · Data Preprocessing. m i c r o a r r a y. n o The microarray pipeline. m i c r o a r r a y. n o Workflow microarray experiment Problem-driven experimental design

mic

roarray.n

o

Variance proportional

to signal

Data are extreme value

distributed

Less variance at higher intensities

Data have distribution that have moved

towards normal distribution (In ideal world perhaps they would be normally

distributed?)

Log transformation

Page 13: Data Preprocessing · Data Preprocessing. m i c r o a r r a y. n o The microarray pipeline. m i c r o a r r a y. n o Workflow microarray experiment Problem-driven experimental design

mic

roarray.n

oNormalisation

• Within slide normalisation

– To account for systematic variation in measured dye intensity due to hybridisation preferences

• Between slide normalisation

– To ensure slides are comparable

– Can be performed on both 1 and 2 channel arrays

Page 14: Data Preprocessing · Data Preprocessing. m i c r o a r r a y. n o The microarray pipeline. m i c r o a r r a y. n o Workflow microarray experiment Problem-driven experimental design

mic

roarray.n

oNormalisation - Median

• Assumption: Changes roughly symmetric

• First panel: smooth density of log2G and log2R.

• Second panel: M vs A plot with median put to zero

Page 15: Data Preprocessing · Data Preprocessing. m i c r o a r r a y. n o The microarray pipeline. m i c r o a r r a y. n o Workflow microarray experiment Problem-driven experimental design

mic

roarray.n

o

Locally Weighted Least Square Regression

Assumption: Variation in data is intensity dependent.

Smoothes the intensity function.

Typically applied to M-A plots

Before After

Lowess normalisation

Page 16: Data Preprocessing · Data Preprocessing. m i c r o a r r a y. n o The microarray pipeline. m i c r o a r r a y. n o Workflow microarray experiment Problem-driven experimental design

mic

roarray.n

o

We assume that the intensity distributions across arrays are the same

This is not always (never?) the case

Intensity distributions need to be similar for the arrays to be comparable

Between array normalisation

Page 17: Data Preprocessing · Data Preprocessing. m i c r o a r r a y. n o The microarray pipeline. m i c r o a r r a y. n o Workflow microarray experiment Problem-driven experimental design

mic

roarray.n

oQuantile normalisation

Page 18: Data Preprocessing · Data Preprocessing. m i c r o a r r a y. n o The microarray pipeline. m i c r o a r r a y. n o Workflow microarray experiment Problem-driven experimental design

mic

roarray.n

oMissing Value Replacement

Find correlatedrows and columns

Borrow informationto estimate missingvalue

LSimpute adaptiveBø et al

<-----------------Samples ------------------>

<--

----

----

----

----

----

--G

e nes

----

----

----

----

----

--- -

>

Page 19: Data Preprocessing · Data Preprocessing. m i c r o a r r a y. n o The microarray pipeline. m i c r o a r r a y. n o Workflow microarray experiment Problem-driven experimental design

mic

roarray.n

o

Questions before summary ?

Page 20: Data Preprocessing · Data Preprocessing. m i c r o a r r a y. n o The microarray pipeline. m i c r o a r r a y. n o Workflow microarray experiment Problem-driven experimental design

mic

roarray.n

oSummary

• Filtering – remove bad quality spots• Log 2 transformation – variance stabilisation• Normalisation – remove systematic variation

and make arrays comparable• Imputation – estimate missing values• Look at box plot / density plots and scatter

plots before and after preprocessing

Page 21: Data Preprocessing · Data Preprocessing. m i c r o a r r a y. n o The microarray pipeline. m i c r o a r r a y. n o Workflow microarray experiment Problem-driven experimental design

mic

roarray.n

oQuestions

Page 22: Data Preprocessing · Data Preprocessing. m i c r o a r r a y. n o The microarray pipeline. m i c r o a r r a y. n o Workflow microarray experiment Problem-driven experimental design

mic

roarray.n

oAcknowledgements

• Most slides borrowed from– Christine Stansberg– Anne-Kristin Stavrum