microarray and ngs data analysis with chipster€¦ · 2. static images produced by analysis tools...
TRANSCRIPT
28.-30.10.2014
Eija Korpelainen, Massimiliano Gentile
Microarray and NGS data analysis with
Chipster
Course outline
Introduction to Chipster
Microarray data analysis
• Importing microarray data to Chipster
• Normalization
• Quality control (also clustering is covered here)
• Filtering
• Statistical testing
• Pathway analysis
• Saving and sharing workflows
Next generation sequencing data analysis (RNA-seq)
• Quality control
• Preprocessing
• Alignment
• Differential expression analysis
Introduction to Chipster
Provides an easy access to over 300 analysis tools
• No programming or command line experience required
Free, open source software
What can I do with Chipster?
• analyze and integrate high-throughput data
• visualize data efficiently
• share analysis sessions
• save and share automatic workflows
Chipster
Interactive visualizations
Analysis tool overview
140 NGS tools for
• RNA-seq
• miRNA-seq
• exome/genome-seq
• ChIP-seq
• FAIRE/DNase-seq
• MeDIP-seq
• CNA-seq
• Metagenomics (16S rRNA)
140 microarray tools for
• gene expression
• miRNA expression
• protein expression
• aCGH
• SNP
• integration of different data
60 tools for sequence analysis
• BLAST, EMBOSS, MAFFT
• Phylip
Tools for QC, processing and mapping
FastQC
PRINSEQ
FastX
TagCleaner
Trimmomatic
Bowtie
TopHat
BWA
Picard
SAMtools
BEDTools
RNA-seq tools
Quality control
• RseQC
Counting
• HTSeq
• eXpress
Transcript discovery
• Cufflinks
Differential expression
• edgeR
• DESeq
• Cuffdiff
• DEXSeq
Pathway analysis
• ConsensusPathDB
miRNA-seq tools
Differential expression
• edgeR
• DESeq
Retrieve target genes
• PicTar
• miRBase
• TargetScan
• miRanda
Pathway analysis for targets
• GO
• KEGG
Correlate miRNA and target expression
Exome/genome-seq tools
Variant calling
• Samtools
Variant filtering
• VCFtools
Variant annotation
• AnnotateVariant (Bioconductor)
ChIP-seq and DNase-seq tools
Peak detection
• MACS
• F-seq
Peak filtering
• P-value, no of reads, length
Detect motifs, match to JASPAR
• MotIV, rGADEM
• Dimont
Retrieve nearby genes
Pathway analysis
• GO, ConsensusPathDB
MeDIP-seq tools
Detect methylation, compare two conditions
• MEDIPS
CNA-seq tools
Count reads in bins
• Correct for GC content
Segment and call CNA
• Filter for mappability
• Plot profiles
Group comparisons
Clustering
Detect genes in CNA
GO enrichment
Integrate with expression
Metagenomics / 16 S rRNA tools
Taxonomy assignment with Mothur package
• Align reads to 16 S rRNA template
• Filter alignment for empty columns
• Keep unique aligned reads
• Precluster aligned reads
• Remove chimeric reads
• Classify reads to taxonomic units
Statistical analyses using R
• Compare diversity or abundance between groups using several
ANOVA-type of analyses
Technical aspects
Client-server system
• Enough CPU and memory for large analysis jobs
• Centralized maintenance
Easy to install
• Client uses Java Web Start
• Server available as a virtual machine
Acknowledgements to users and contibutors
Chipster: General functionality
Select data
Select tool category
Select tool (set parameters if necessary) and click run
View results
Mode of operation
Analysis history is saved automatically -you can add tool source code to reports if needed
Task manager
You can run many analysis jobs at the same time
Use Task manager to
• view status
• cancel jobs
• view time
• view parameters
Analysis sessions
In order to continue your work later, you have to save the
analysis session.
Saving the session will save all the files and their
relationships. The session is packed into a single .zip file and
saved on your computer (in the next Chipster version you can
also save it on the server).
Session files allow you to continue the work on another
computer, or share it with a colleague.
You can have multiple analysis sessions saved separately, and
combine them later if needed.
Workflow panel
Shows the relationships of the files
You can move the boxes around, and zoom in and out.
Several files can be selected by keeping the Ctrl key down
Right clicking on the data file allows you to
• Save an individual result file (”Export”)
• Delete
• Link to another data file
• Save workflow
Workflow – reusing and sharing your
analysis pipeline
You can save your analysis steps as a reusable automatic
”macro”, which you can apply to another dataset
When you save a workflow, all the analysis steps and their
parameters are saved as a script file, which you can share with
other users
Saving and using workflows
Select the starting point for your
workflow
Select ”Workflow/ Save starting
from selected”
Save the workflow file on your
computer with a meaningful name
• Don’t change the ending (.bsh)
To run a workflow, select
• Workflow->Open and run
• Workflow->Run recent (if you
saved the workflow recently).
Problems? Send us a support request -request includes the error message and link to analysis
session (optional)
Data visualizations
Visualizing the data
Data visualization panel
• Maximize and redraw for better viewing
• Detach = open in a separate window, allows you to view several images at the same time
Two types of visualizations
1. Interactive visualizations produced by the client program
• Select the visualization method from the pulldown menu
• Save by right clicking on the image
2. Static images produced by analysis tools
• Select from Analysis tools/ Visualisation
• View by double clicking on the image file
• Save by right clicking on the file name and choosing ”Export”
Interactive visualizations by the client
Spreadsheet
Histogram
Venn diagram
Scatterplot
3D scatterplot
Volcano plot
Expression profiles
Clustered profiles
Hierarchical clustering
SOM clustering
Genome browser
Available actions:
• Select genes and create a gene list
• Change titles, colors etc
• Zoom in/out
Static images produced by R/Bioconductor
Box plot
Histogram
Heatmap
Idiogram
Chromosomal position
Correlogram
Dendrogram
NMDS plot
QC stats plot
RNA degradation plot
K-means clustering
SOM-clustering
etc
Importing microarray data to Chipster
Importing raw data
Affymetrix CEL-files are recognized by Chipster automatically
You can import Illumina GenomeStudio files to Chipster as is, if all the
samples are in one file
• Need columns AVG, BEAD_STDERR, Avg_NBEADS and DetectionPval
• Note: Use lumi normalization for data imported this way
You can import any tab delimited files (e.g. Agilent) using the Import tool
Import tool
Step 1: Define title row, header and footer
Import tool
Step 2: Define columns
Which columns to mark in Import tool?
http://chipster.csc.fi/manual/import-help.html
Agilent
• Identifier (ProbeName)
• Sample (rMeanSignal or rMedianSignal)
• Sample background (rBGMedianSignal)
• Control (gMeanSignal or gMedianSignal)
• Control background (gBGMedianSignal)
• Flag (Control type)
Illumina BeadStudio version 3 file and GenomeStudio files
• Identifier (ProbeID)
• Sample (text “AVG”)
Illumina BeadStudio version 1-2 file
• Identifier (TargetID)
• Sample (text “AVG”)
1-color 2-color
Importing normalized data
The data should be tab delimited and preferably log-transformed
• If your data is not log-transformed, you can transform it with the tool
“Change interpretation”
Bring the data file in using the Import tool. Mark the identifier column
and all the sample columns.
Run the tool Normalize / Process prenormalized. This
• Converts data to Chipster format by adding ”chip.” to expression
column names
• Creates the phenodata file. Indicate chiptype using names given at
http://chipster.csc.fi/manual/supported-chips.html
Exercise 1. Import Illumina data in two ways
Import Illumina data directly and normalize with the lumi tool
• Select File / Import files.
• Select the file IlluminaHuman6v1_BS1.tsv.
• In the Import files -window choose the action "Import directly“
• Select the file and the tool Normalization/ Illumina – lumi pipeline.
Set the chiptype parameter to Human and click Run.
Import the same file using the Import tool and normalize
• In the Import files -window choose the action “Use Import tool”
• Click the Mark header –button and paint the header rows.
• Click the Mark title row –button and click on the title row of the data.
• Click Next. Click the Identifier –button and click on the TargetID
column. Click the Sample –button and click on all the AVG columns.
• Select the 8 files and tool Normalization/ Illumina. Set parameters: • Illumina software version = BeadStudio1
• identifier type = TargetID and chiptype = Human-6v1.
Microarray data analysis flow chart
Normalization
Normalization methods for different arrays
Affymetrix
• Background correction + expression estimation +
summarization
• RMA (Robust Multichip Averaging) uses only PM probes,
fits a model to them, and gives out expression values after
quantile normalization and median polishing
Agilent
• Background correction + averaging duplicate spots +
normalization
Illumina
• Background correction (in GenomeStudio) + normalization
NOTE: After normalization the expression values are in log2-scale
• Hence a fold change of 2 means 4-fold up, -2 means 4-fold
down, etc
Normalization of Affymetrix data
Normalization = background correction + expression estimation + summarization
Methods: MAS5, Plier, RMA, GCRMA, Li-Wong
• RMA is the default, and works rather nicely if you have more than a few chips
• GCRMA is similar to RMA, but takes also GC% content into account
• MAS5 is the older Affymetrix method, Plier is a newer one
• Li-Wong is the method implemented in dChip
Variance stabilization makes the variance over all the chips similar
• Works only with MAS5 and Plier (the other methods log2-transform the data,
which corrects for the same phenomenon)
Custom chiptype
• Because some of the Affymetrix probe-to-transcript mappings are not correct,
probes have been remapped in the Bioconductor project. To use these
remappings (alt CDF environments), select the matching chiptype from the
Custom chiptype menu.
Normalization of Agilent data
Background treatment often generates negative values, which are
coded as missing values after log2-transformation.
• Usual subtract option does this
• Using normexp + offset 50 will not generate negative values,
and it gives good estimates (the best method reported)
Loess removes curvature from the data (recommended)
Before After
Agilent normalization parameters in Chipster
Background correction
• Background treatment
None, Subtract, Edwards, Normexp
• Background offset
0 or 50
Normalize chips
• None, median, loess
Normalize genes
• None, scale (to median), quantile
• not needed for statistical analysis
Chiptype
• You must give this information in order to use annotation-based
tools later
Illumina normalization in Chipster: 2 options
Normalization / Illumina
• Normalization method None, scale, quantile, vsn (variance stabilizing normalization)
• Illumina software version GenomeStudio or BeadStudio3, BeadStudio2, BeadStudio1
• Chiptype
• Identifier type Target ID, Probe ID (for BeadStudio version 3 data and newer)
Normalization / Illumina - Lumi pipeline
• Transformation none, vst (variance stabilizing transformation), log2
• Normalize chips none, rsn (robust spline normalization), loess, quantile, vsn
• Chiptype human, mouse, rat
• Background correction none, bgAdjust.Affy
Quantile normalization procedure
Sample A Sample B Sample C
Gene 1 20 10 350
Gene 2 100 500 200
Gene 3 300 400 30
Sample A Sample B Sample C Median
Quantile 1 20 10 30 20
Quantile 2 100 400 200 200
Quantile 3 300 500 350 350
Sample A Sample B Sample C Median
Quantile 1 20 20 20 20
Quantile 2 200 200 200 200
Quantile 3 350 350 350 350
Sample A Sample B Sample C
Gene 1 20 20 350
Gene 2 200 350 200
Gene 3 350 200 20
1. Raw data 2. Rank data within sample and calculate median intensity for each row
3. Replace the raw data of each row with its median (or mean) intensity 4. Restore the original gene order
Checking normalization
Exercise 2: Normalize Illumina data
Import folder IlluminaTeratospermiaHuman6v1_BS1 using the Import
tool like in exercise 1.
In the workflow view, click on the box ”13 files” to select all of them
Select the tool Normalization / Illumina. Set parameters so that
• Illumina software version = BeadStudio1
• identifier type = TargetID
• chiptype = Human-6v1
Repeat the run as before, but change
• Normalization method = none
• This ”mock-normalization” gives you unnormalized data that you can use
as a comparison when looking at the normalization effect later in the QC
exercise.
Phenodata: Describing the experimental
setup
Phenodata file
Experimental setup is described with a phenodata file, which is created during
normalization
Fill in the group column with numbers describing your experimental groups
• e.g. 1 = control sample, 2 = cancer sample
• necessary for the statistical tests to work
• note that you can sort a column by clicking on its title
How to describe pairing, replicates, time, etc?
You can add columns to the phenodata file
Time
• Use either real time values or recode with group codes
Replicates
• All the replicates are coded with the same number
Pairing
• Pairs are coded using the same number for each pair
Phenodata for prenormalized data
If you bring in previously created normalized data and phenodata:
• Choose ”import directly” in the Import tool
• Right click on normalized data, choose ”Link to phenodata”
If you brought in normalized data and need to create phenodata for it:
• Use Import tool to bring the data in
• Use the tool Normalize / Process prenormalized to create
phenodata
• Remember to give the chiptype
• Fill in the group column
Exercise 3: Describe the experiment
Double click the phenodata file of the real normalization
In the phenodata editor, enter 1 in the group column for the control
samples and 2 for the teratospermia affected samples.
For the interest of visualizations later on, give shorter names for the
samples in the Description column
• Name the teratospermia samples t1, t2,….t8
• Name the controls samples c1, c2 ,..c5
Quality control at array level
Quality control tools in Chipster
Quality control category (array level QC)
• Affymetrix basic (RNA degradation + Affy QC)
• Affymetrix RLE and NUSE fit a model to expression values
• Illumina (density plot + boxplot)
• Agilent 2-color (MA-plot + density plot + boxplot)
• Agilent 1-color (density plot + boxplot)
Visualization category (experiment level QC)
• Dendrogram
• Correlogram
Statistics category (experiment level QC)
• Non-metric multidimensional scaling (NMDS)
• Principal components analysis (PCA)
QC for Affymetrix data
Affymetrix QC metrix
RNA degradation
Spike-in controls linearity
RLE (relative log expression)
NUSE (normalized unscaled standard error plot)
Affymetrix QC metrix
Blue area shows where scaling
factors are less than 3-fold of the
mean.
•If the scaling factors or ratios fall
within this region (1.25-fold for
GADPH), they are colored blue,
otherwise red
average background on the chip
probesets with present flag
scaling factors for the chips
beta-actin 3':5' ratio
GADPH 3':5' ratio
Affymetrix spike-ins and RNA degradation
Spike-in linearity
RNA degradation plot
Affymetrix: RLE and NUSE
RLE (relative log expression)
NUSE (normalized unscaled standard error)
Agilent / Illumina: density plot and box plot
Scatter plot of log intensity ratios M=log2(R/G) versus average log intensities
A = log2 (R*G), where R and G are the intensities for the sample and control,
respectively
M is a mnemonic for minus, as M = log R – log G
A is mnemonic for add, as A = (log R + log G) / 2
Agilent QC: MA-plot
Exercise 4: Illumina array level quality control
Run Quality control / Illumina for the normalized data
Repeat this for the ”mock-normalized” data and compare the results
(use the Detach button to view the images side by side). Can you see
the effect of normalization?
Quality control at experiment level
(inc. clustering theory)
Principal component analysis (PCA)
Goal
Method
Project a high dimensional space into a lower dimensional space
Compute a variance-covariance
matrix for all variables (genes)
The first principal component is
the linear combination of variables
that maximizes the variance
The linear combination,
orthogonal to the first, that
maximizes variance is the second
principal component. Etc.
Z
Y X
Z-Y
Z-X
Y-X
Explains most of the
variability in the shape
of the pen
X is the first principal
component of the pen
PCA illustration
X-Y
Z
Y X
Z-Y
Z-X
Y-X
Explains most of the
remaining variability in
the shape of the pen
Y is the second principal
component of the pen
PCA illustration, continued
Non-metric multidimensional scaling (NMDS)
Goal
Method
Project a high dimensional space into a lower dimensional space
Compute a distance matrix for all
variables (genes)
Define number of dimensions of
reduced space
Construct the dimensions as to
maximise the similarity of
distances between the high and
lower dimensional space
Compared to PCA: Allows choice
of distance metric
better agreement with
clustering methods
Dendrogram and correlogram
Dendrogram
Correlogram
Hierarchical clustering parameters
Distance method
- Euclidean
- Pearson correlation
- Spearman correlation
- Manhattan correlation
Drawing method
- Single linkage
- Average linkage
- Complete linkage
Pearson correlation Euclidean distance
Hierarchical clustering distance methods
One can either calculate the distance between two pairs of data
sets (e.g. samples) or the similarity between them
Distance methods can yield very different results
Correlations are sensitive to outliers (use Spearman)!
Hierarchical clustering drawing methods
single linkage average linkage complete linkage
Hierarchical clustering (euclidean distance)
calculate
distance
matrix
gene 1 gene 2 gene 3 gene 4
gene 1 0gene 2 2 0gene 3 8 7 0gene 4 10 12 4 0
calculate averages of
most similar
calculate averages of
most similar
gene 1,2 gene 3 gene 4
gene 1,2 0gene 3 7.5 0gene 4 11 4 0
gene 1,2 gene 3,4
gene 1,2 0gene 3,4 9.25 0
calculate
distance
matrix
gene 1 gene 2 gene 3 gene 4
gene 1 0gene 2 2 0gene 3 8 7 0gene 4 10 12 4 0
gene 1,2 gene 3 gene 4
gene 1,2 0gene 3 7.5 0gene 4 11 4 0
gene 1,2 gene 3,4
gene 1,2 0gene 3,4 9.25 0
1 2 3 4
calculate averages of
most similar
calculate averages of
most similar
Dendrogram
Hierarchical clustering (avg. linkage)
1 2 3 4 5 1 2 3 4 5
1 2 3 4 5 1 2 3 4 5
When assessing similarity, look at the branching pattern instead of
sample order
Exercise 5: Experiment level quality control
Run Statistics / NMDS for the normalized data
• Do the groups separate along the first dimension?
Run Statistics / PCA on the normalized data.
• View pca.tsv as ”3D scatter plot for PCA”. Can you see 2 groups?
• Check in variance.tsv how much variance the first principal component
explains?
Run Visualization / Dendrogram for the normalized data
• Do the groups separate well?
Save the analysis session with name sessionTeratospermia.zip
Filtering
Independent filtering
Independent filtering does no take into account the group
information
Removing probes for genes that don’t have any chance of
being differentially expressed
• Not expressed
• Not changing
Reduces the severity of multiple testing correction
• Some controversy on whether filtering should be used or not…
Filtering tools in Chipster
Filter by standard deviation (SD)
• Select the percentage of genes to be filtered out
Filter by coefficient of variation (CV = SD / mean)
• Select the percentage of genes to be filtered out
Filter by flag
• Flag value and number of arrays
Filter by expression
• Select the upper and lower cut-offs
• Select the number of chips required to fulfil this rule
• Select whether to return genes inside or outside the range
Filter by interquartile range (IQR)
• Select the IQR
Venn diagram Select 2-3 datasets and the visualization method Venn diagram
• Can use Venn also for filtering with your own gene list
identifier list
Exercise 6: Filtering
Select the normalized data and play with different filters. In order to
compare the results, set the cutoffs so that you get approximately
the same amount of filtered genes (for example 0.9 for SD and CV,
and 1.1 for IQR)
• Preprocessing / Filter by SD
• Preprocessing / Filter by CV
• Preprocessing / Filter by IQR
Select the result files and compare them using the interactive Venn
diagram visualization
• Save the genes specific to SD filter to a new list
• Save the genes specific to CV filter to a new list
• View both as expression profiles. Is there a difference in
expression levels of the two sets?
Statistical testing
Statistical analysis of microarray data: Why?
Distinguish biological and measurement variation from
treatment effect
Generalisation of results
replication
estimation of uncertainty (variability)
representative sample
statistical inference
sampling inference
Expression
Comparing means (1-2 groups)
student’s t
Comparing means (>2 groups)
1-way ANOVA
Comparing means (multifactor)
2-way ANOVA
Expression
Group A B
Parametric methods Non-parametric methods
Comparing ranks (2 groups)
Comparing ranks (>2 groups)
Kruskal-Wallis
Mann-Whitney
Ranks
group A group B
1 6
2 7
3 8
4 9
5 10
Parametric statistics
0 200 400 600 800 1000 1200
0.0
00
0.0
02
0.0
04
0.0
06
0.0
08
dnorm
(x, m
ean =
250, sd
= 5
0)
x1 x2
s1
s2
H0 : A = B , A - B = 0
H1 : A B
2
2
2
1
2
1
21
n
s
n
s
xxt
Type 1 error, a
Type 2 error, b
Power = 1 - b
Comparing ranks (2 groups)
Comparing ranks (>2 groups)
Kruskal-Wallis
B
Mann-Whitney
Expression
Group
A
Non-parametric statistics
Ranks
group A group B
1 4
2 6
3 7
5 9
8 10
111
2112
)1(** R
nnnnU
222
2122
)1(** R
nnnnU
Benefits
Non-parametric compared to parametric tests
Drawbacks
Do not make any assumptions on data distribution
robust to outliers
allows for cross-experiment comparisons
Lower power than parametric counterpart
Granular distribution of calculated statistic
many genes get the same rank
requires at least 6 samples / group
Multiple testing problem
Problem
1 gene, a = 0.05
false positive incidence = 1 / 20
30 000 genes, a = 0.05
false positive incidence = 1500
Solution
Bonferroni
Holm (step down)
Westfall & Young
Benjamini & Hochberg
more false negatives
more false positives
Multiple testing correction methods
Bonferroni
corrected p-value = p * n
a = a / n
correction
raw p
n
1
Holm (step down)
rank p-values
highest ranked p-value = p * n
next lower ranked p-value = p * (n-1)
.
.
lowest ranked p-value = p (n-(n-1)) = p
correction
raw p
n
1
Multiple testing correction methods II
Benjamini & Hochberg rank p-values from smallest to largest
largest p-value remains unaltered
second largest p-value = p * n / (n-1)
third largest p-value = p * n / (n-2)
.
.
smallest p-value = p * n / (n-n+1) = p * n
raw p
n
1
correction
Comparison of multiple testing correction methods
-3 -2 -1 0 1 2
1 e
-04
1 e
-02
1 e
+0
0
No correction
Log2 (ratio)
-lo
g2
(p
-va
lue
)
-3 -2 -1 0 1 2
1 e
-04
1 e
-02
1 e
+0
0
Bonferroni
Log2 (ratio)
-lo
g2
(p
-va
lue
)
-3 -2 -1 0 1 2
1 e
-04
1 e
-02
1 e
+0
0
FDR
Log2 (ratio)-l
og
2 (
p-v
alu
e)
Exercise 7: Statistical testing
Run different two group tests
• Select the file sd-filter.tsv and run Statistics / Two group test with the default parameter setting
• Repeat the run but change the parameter ”test” to t-test and next to Mann-Whitney. Rename the results to t.tsv and MW.tsv, respectively.
Compare the results with a Venn diagram
• Which method seems most powerful?
• Select the genes common to all three datasets and create a new dataset
Exercise 8: Viewing statistical testing results
View the Empirical Bayes result as an interactive volcano plot
• Select the two-sample.tsv and visualization method volcano plot
Filter based on fold change
• Select the file two-sample.tsv and run the tool Preprocessing / Filter using a column value so that you keep genes that have log2 fold change higher than +/-3 (select FC column, cutoff = 3, and ”outside”). View the result file as an expression profile.
More accurate estimates of effect size and variability
How to improve statistical power?
Variance shrinking
(Empirical Bayes)
Partitioning variability
(ANOVA, linear modeling)
Increase number of replicates (biological)
Randomization
Blocking
Experimentally
Analytically
Improving power with variance shrinking
Concept
Borrow information from other genes with similar expression level and form
a pooled error estimate
How ?
model the error-intensity dependence based on replicate to replicate
comparisons
use a smoothing function to estimate the error for any given intensity
calculate a weighted average between observed gene specific variance
and model-derived variance (pooling)
incorporate the pooled variance estimate in the statistical test (usually t-
or F-test)
Linear modeling: Pairing
unpaired Experimental
Group 1
Experimental
Group 2
2 3
2 4
3 2
1 3
2 3
0.8 0.8
Mean
sd
paired Experimental
Group 1
Experimental
Group 2
Difference
2 3 1
2 3 1
3 4 1
1 2 1
One sample
T-test
Linear modeling: Multiple factors
1 factor
2 factors Experimental
Group 1
Experimental
Group 2
Males
2
3
1
6
7
5
Mean 2 6
Females
8
9
7
4
5
3
Mean 8 4
Experimental
Group 1
Experimental
Group 2
2 5
9 7
1 3
7 5
8 4
3 6
5 5 Mean
Linear modeling: Interaction effect
Group mean
Group
Males Females
exp. group 1
exp. group 2
2
4
6
8
10
Linear modeling in Chipster
Linear modeling: Setting up the model
Annotation
Annotation
Gene annotation = information about biological function, pathway
involvement, chromosal location etc
Annotation information is collected from different biological databases
to a single database by the Bioconductor project
Annotation information is required by certain analysis tools
(annotation, GO/KEGG enrichment, promoter analysis, chromosomal
plots)
• These tools don’t work for those chiptypes which don’t have
Bioconductor annotation packages
Alternative CDF environments for Affymetrix
CDF is a file that links individual probes to gene transcripts
Affymetrix default annotation uses old CDF files that map many
probes to wrong genes
Alternative CDFs fix this problem
In Chipster selecting ”custom chiptype” in Affymetrix
normalization takes altCDFs to use
For more information see
• Dai et al, (2005) Nuc Acids Res, 33(20):e175: Evolving
gene/transcript definitions significantly alter the interpretation of
GeneChip data
• http://brainarray.mbni.med.umich.edu/Brainarray/Database/Cust
omCDF/genomic_curated_CDF.asp
Also a problem with Illumina
Probes are remapped in the R/Bioconductor project
Chipster uses remapped probes
For more information see
• Barbosa-Morais NL, Dunning MJ, Samarajiwa SA, Darot JFJ,
Ritchie ME, Lynch AG, Tavaré S. "A re-annotation pipeline for
Illumina BeadArrays: improving the interpretation of gene
expression data". Nucleic Acids Research, 2009 Nov 18,
doi:10.1093/nar/gkp942
• https://prod.bioinformatics.northwestern.edu/nuID/
Exercise 9: Annotation
Annotate genes
• Select the file column-value-filter.tsv
• Run Annotation / Illumina gene list
• Open the result file annotations.html and click the links in the
gene and pathway columns to read more about one of the
genes
Pathway analysis
Pathway analysis – why?
Statistical tests can yield thousands of differentially expressed
genes
It is difficult to make ”biological” sense out of the result list
Looking at the bigger picture can be helpful, e.g. which
pathways are differentially expressed between the
experimental groups
Databases such as KEGG, GO, Reactome and
ConsensusPathDB provide grouping of genes to pathways,
biological processes, molecular functions, etc
Two approaches to pathway analysis
• Gene set enrichment analysis
• Gene set test
Approach I: Gene set enrichment analysis
Apoptosis (200 genes)
Our list (50 genes)
Genome (30,000 genes)
Our list and apoptosis (10 genes)
H0 : = 30 >> 1 50 10 ____
_______
30000 200 ____
1. Perform a statistical test to find differentially expressed genes
2. Check if the list of differentially expressed genes is ”enriched”
for some pathways
Approach II: Gene set test
1. Do NOT perform
differential gene
expression analysis
2. Group genes to pathways
and perform differential
expression analysis for
the whole pathway
Advantages
• More sensitive than single
gene tests
• Reduced number of tests
-> less multiple testing
correction -> increased
power
ConsensusPathDB
One-stop shop: Integrates pathway information from 32
databases covering
• biochemical pathways
• protein-protein, genetic, metabolic, signaling, gene regulatory
and drug-target interactions
Developed by Ralf Herwig’s group at the Max-Planck Institute
in Berlin
ConsensusPathDB over-representation analysis tool is
integrated in Chipster
• runs on the MPI server in Berlin
GO (Gene Ontology)
Controlled vocabulary of terms for describing gene product
characteristics
3 ontologies
• Biological process
• Molecular function
• Cellular component
Hierarchical structure
KEGG
Kyoto Encyclopedia for Genes and Genomes
Collection of pathway maps representing molecular
interaction and reaction networks for
• Metabolism
• cellular processes
• diseases, etc
Exercise 10: Gene set enrichment analysis
Identify over-represented GO terms
• Select the two-sample.tsv file
• Run Pathways / Hypergeometric test for GO
Extract genes for a specific GO term
• Open the hypergeo.tsv file and copy the GO id number for the top term.
• Select two-sample.tsv and run tool Utilities / Extract genes from GO,
pasting the GO id into the parameter field.
• Open the extracted-from-GO.tsv in Spreadsheet and Expression
profile view.
Identify over-represented ConsensusPathDB pathways
• Select two-sample.tsv and run tool and Pathways / Hypergeometric
test for ConsensusPathDB.
• Click on the hyperlinks in the cpdb.html tile to get more info on a
particular pathway.
Exercise 11: Gene set test
Identify differentially expressed KEGG pathways
• Select the normalized.tsv file and Pathways / Gene set test. Set
the number of groups to visualize = 4
• Explore the results both in tabular fromat and graphically in the
global-test-result-table.tsv and multtest.png files, respectively.
Exercise 12: Saving a workflow
Prune your workflow if necessary (remove cyclic structures)
Select the file normalized.tsv and click on the Workflow / Save
starting from selected. Give your workflow a meaningful name
and save it.
Illumina data analysis summary
Normalization
• lumi method
Quality control at array level: are there outlier arrays?
• density graph and boxplot
Quality control at experiment level: do the sample groups
separate?
• PCA, NMDS, dendrogram
Independent filtering of genes
• e.g. 50% based on coefficient of variation
• Depends on the statistical test to be used later
Statistical testing
• Empirical Bayes method (two group test / linear modeling)
Annotation, pathway analysis, promoter analysis, clustering,
classification…
Next generation sequencing data analysis
(RNA-seq)
Outline
1. Introduction to RNA-seq
2. RNA-seq data analysis
• Quality control, preprocessing
• Alignment (=mapping) to reference genome
• Manipulation of alignment files
• Alignment level quality control
• Visualization of alignments in genome browser
• Quantitation
• Differential expression analysis
Introduction to RNA-seq
What can I measure with RNA-seq?
Which genes are over-expressed (or under-expressed) in
patients vs. healthy controls?
Which genes are correlated with disease progression?
Which genes undergo isoform switching in disease?
Find new genes and isoforms
Etc etc
Is RNA-seq better than microarrays?
+ Wider detection range
+ Can detect new genes and isoforms
- Data analysis is not as established as for microarrays
- Data is voluminous
Typical steps in RNA-seq
http://cmb.molgen.mpg.de/2ndGenerationSequencing/Solas/RNA-seq.html
RNA-seq data analysis
Gene A Gene B
Align reads to reference genome
Match alignment positions with known gene positions
Count how many reads each gene has
A = 6 B = 11
RNA-seq data analysis: file formats
Gene A Gene B
Fastq (raw reads)
Fasta (ref. genome)
GTF (ref. annotation)
BAM (aligned reads)
A = 6 B = 11 tsv (count table)
RNA-seq data analysis workflow
Quality control, preprocessing
(FastQC, PRINSEQ, Trimmomatic)
Align reads to reference
(TopHat, STAR)
RNA-seq data analysis workflow
De novo assembly
(Trinity, Velvet+Oases)
Align reads to transcripts
(Bowtie)
Reference based
assembly to detect new
transcripts and isoforms
(Cufflinks)
Quantitation
(HTSeq, eXpress, etc)
reference
Annotation
(Blast2GO)
Differential expression analysis
(edgeR, DESeq, Cuffdiff)
reads
Quality control, preprocessing
(FastQC, PRINSEQ, Trimmomatic)
Align reads to reference
(TopHat, STAR)
RNA-seq data analysis workflow today
De novo assembly
(Trinity, Velvet+Oases)
Align reads to transcripts
(Bowtie)
Reference based
assembly to detect new
transcripts and isoforms
(Cufflinks)
Quantitation
(HTSeq, eXpress, etc)
reference
Annotation
(Blast2GO)
Differential expression analysis
(edgeR, DESeq, Cuffdiff)
reads
Things to take into account
Non-uniform coverage along transcripts
• Biases introduced in library construction and sequencing
• polyA capture and polyT priming can cause 3’ bias
• random primers can cause sequence-specific bias
• GC-rich and GC-poor regions can be under-sampled
• Genomic regions have different mappabilities (uniqueness)
Longer transcripts give more counts
RNA composition effect due to sampling:
Quality control of raw reads
What and why?
Potential problems
• low confidence bases, Ns
• sequence specific bias, GC bias
• adapters
• sequence contamination
• …
Knowing about potential problems in your data allows you to
correct for them before you spend a lot of time on analysis
take them into account when interpreting results
Software packages for quality control
FastQC
FastX
PRINSEQ
TagCleaner
...
Raw reads: FASTQ file format
Four lines per read:
• Line 1 begins with a '@' character and is followed by a sequence identifier.
• Line 2 is the sequence.
• Line 3 begins with a '+' character and can be followed by the sequence identifier.
• Line 4 encodes the quality values for the sequence, encoded with a single ASCII
character for brevity.
• Example:
@read name
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+ read name
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
http://en.wikipedia.org/wiki/FASTQ_format
Base qualities
If the quality of a base is 30, the probability that it is wrong is 0.001.
• Phred quality score Q = -10 * log10 (probability that the base is wrong)
T C A G T A C T C G
40 40 40 40 40 40 40 40 37 35
Encoded as ASCII characters so that 33 is added to the Phred score
• This ”Sanger” encoding is used by Illumina 1.8+, 454 and SOLiD
• Note that older Illumina data uses different encoding
• Illumina1.3: add 64 to Phred
• Illumina 1.5-1.7: add 64 to Phred, ASCII 66 ”B” means that the whole read segment
has low quality
Base quality encoding systems
http://en.wikipedia.org/wiki/FASTQ_format
Per position base quality (FastQC)
good
ok
bad
Per position base quality (FastQC)
Per position sequence content (FastQC)
Sequence specific bias: Correct sequence but biased
location, typical for Illumina RNA-seq data
Per position sequence content (FastQC)
Preprocessing: Filtering and trimming low
quality reads
Filtering vs trimming
Filtering removes the entire read
Trimming removes only the bad quality bases
• It can remove the entire read, if all bases are bad
Trimming makes reads shorter
• This might not be optimal for some applications
Paired end data: the matching order of the reads in the two files
has to be preserved
• If a read is removed, its pair has to removed as well
What base quality threshold should be used?
No consensus yet
Trade-off between having good quality reads and having enough
sequence
Software packages for preprocessing
FastX
PRINSEQ
TagCleaner
Trimmomatic
Cutadapt
TrimGalore!
...
PRINSEQ filtering possibilities in Chipster
Base quality scores
• Minimum quality score per base
• Mean read quality
Ambiguous bases
• Maximum count/ percentage of Ns that a read is allowed to have
Low complexity
• DUST (score > 7), entropy (score < 70)
Length
• Minimum length of a read
Duplicates
• Exact, reverse complement, or 5’/3’ duplicates
Copes with paired end data
PRINSEQ trimming possibilities in Chipster
Trim based on quality scores
• Minimum quality, look one base at a time
• Minimum (mean) quality in a sliding window
• From 3’ or 5’ end
Trim x bases from left/ right
Trim to length x
Trim polyA/T tails
• Minimum number of A/Ts
• From left or right
Copes with paired end data
We want to find genes which are differentially expressed between two human cell lines, h1-hESC and GM12878
Illumina data from the ENCODE project
• 75 b single-end reads
• For the interest of time, only a small subset of reads is used
• No replicates (Note: you should always have at least 3 replicates!)
Data set for exercises 1-9
Aligning reads to genome/transcriptome
Alignment to reference genome/transcriptome
Goal is to find out where a read originated from
• Challenge: variants, sequencing errors, repetitive sequence
Mapping to
• transcriptome allows you to count hits to known transcripts
• genome allows you to find new genes and transcripts
Many organisms have introns, so RNA-seq reads map to
genome non-contiguously spliced alignments needed
• Difficult because sequence signals at splice sites are limited
and introns can be thousands of bases long
Tens of aligners are available
http://wwwdev.ebi.ac.uk/fg/hts_mappers/
Splice-aware aligners
TopHat (uses Bowtie)
STAR
GSNAP
RUM
MapSplice
...
Nature methods 2013 (10:1185)
Mapping quality
Confidence in read’s point of origin
Depends on many things, including
• uniqueness of the aligned region in the genome
• length of alignment
• number of mismatches and gaps
Expressed in Phred scores, like base qualities
• Q = -10 * log10 (probability that mapping location is wrong)
Bowtie2
Fast and memory efficient aligner
Can make gapped alignments (= can handle indels)
Cannot make spliced alignments, but is used by TopHat2 which can
Two alignment modes:
• End-to-end
• Read is aligned over its entire length
• Maximum alignment score = 0, deduct penalty for each mismatch (less for low
quality base), N, gap opening and gap extension
• Local
• Read ends don’t need to align, if this maximizes the alignment score
• Add bonus to alignment score for each match
Reference (genome) is indexed to speed up the alignment process
TopHat2
Relatively fast and memory efficient spliced aligner
Performs several alignment steps
Uses Bowtie2 end-to-end mode for aligning
• Low tolerance for mismatches
If annotation (GTF file) is available, builds a virtual transcriptome
and aligns reads to that first
Kim et al, Genome Biology 2013
TopHat2 spliced alignment steps
Kim et al, Genome Biology 2013
File format for aligned reads: BAM/SAM
SAM (Sequence Alignment/Map) is a tab-delimited text file. BAM is a
binary form of SAM.
Optional header (lines starting with @)
One line for each alignment, with 11 mandatory fields:
• read name, flag, reference name, position, mapping quality, CIGAR,
mate name, mate position, fragment length, sequence, base qualities
• CIGAR reports match (M), insertion (I), deletion (D), intron (N), etc
Example:
@HD VN:1.3 SO:coordinate
@SQ SN:ref LN:45
r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
• The corresponding alignment
Ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT
r001 TTAGATAAAGGATA*CTG
BAM file (.bam) and index file (.bai)
BAM files can be sorted by chromosomal coordinates and
indexed for efficient retrieval of reads for a given region.
The index file must have a matching name. (e.g. reads.bam
and reads.bam.bai)
Genome browser requires both BAM and the index file.
The alignment tools in Chipster automatically produce sorted
and indexed BAMs.
When you import BAM files, Chipster asks if you would like to
preproces them (convert SAM to BAM, sort and index BAM).
Manipulating BAM files (SAMtools, Picard)
Convert SAM to BAM, sort and index BAM
• ”Preprocessing” when importing SAM/BAM, runs on your computer.
• The tool available in the ”Utilities” category runs on the server.
Index BAM
Statistics for BAM
• How many reads align to the different chromosomes.
Count alignments in BAM
• How many alignments does the BAM contain.
• Includes an optional mapping quality filter.
Retrieve alignments for a given chromosome/region
• Makes a subset of BAM, e.g. chr1:100-1000, inc quality filter.
Create consensus sequence from BAM
Region file formats: BED
5 obligatory columns: chr, start, end, name, score
0-based, like BAM
Region file formats: GFF/GTF
9 obligatory columns: chr, source, name, start, end, score,
strand, frame, attribute
1-based
Quality control of aligned reads
Quality metrics for aligned reads
How many reads mapped to the reference? How many of them
mapped uniquely?
How many pairs mapped? How many mapped concordantly?
Mapping quality distribution?
Saturation of sequencing depth
• Would more sequencing detect more genes and splice junctions?
Read distribution between different genomic features
• Exonic, intronic, intergenic regions
• Coding, 3’ and 5’ UTR exons
• Protein coding genes, pseudogenes, rRNA, miRNA, etc
Coverage uniformity along transcripts
Quality control programs for aligned reads
RseQC (available in Chipster)
RNA-seqQC
Qualimap
Picards’s CollectRnaSeqMetrics
Visualization of reads and results in
genomic context
Software packages for visualization
Chipster genome browser
IGV
UCSC genome browser
....
Differences in memory consumption, interactivity,
annotations, navigation,...
Chipster Genome Browser
Integrated with Chipster analysis environment
Automatic sorting and indexing of BAM, BED and GTF files
Automatic coverage calculation (total and strand-specific)
Zoom in to nucleotide level
Highlight variants
Jump to locations using BED, GTF and tsv files
View details of selected BED and GTF features
Several views (reads, coverage profile, density graph)
Quantitation
Software for counting aligned reads per
genomic features (genes/exons/transcripts)
HTSeq
Cufflinks
BEDTools
Qualimap
eXpress
...
HTSeq count
Given a BAM file and a list of genomic features (e.g. genes),
counts how many reads map to each feature.
• For RNA-seq the features are typically genes, where each gene is
considered as the union of all its exons.
• Also exons can be considered as features, e.g., in order to check
for alternative splicing.
Features need to be supplied in GTF file
• Note that GTF and BAM must use the same chromosome naming
3 modes to handle reads which overlap several genes
• Union (default)
• Intersection-strict
• Intersection-nonempty
HTSeq count modes
Estimating gene expression at gene level - the isoform switching problem
Trapnell et al. Nature Biotechnology 2013
Differential expression analysis
Things to take into account
Biological replicates are important!
Normalization is required in order to compare expression
between samples
• Different library sizes
• RNA composition bias caused by sampling approach
Model has to account for overdispersion in biological
replicates negative binomial distribution
Raw counts are needed to assess measurement precision
• Counts are the ”the units of evidence” for expression
Multiple testing problem
Software packages for DE analysis
edgeR
DESeq
DEXSeq
Cuffdiff
Ballgown
SAMseq
NOIseq
Limma + voom, limma + vst
...
Comments from comparisons
”Methods based on negative binomial modeling have
improved specificity and sensitivities as well as good
control of false positive errors”
”Cuffdiff performance has reduced sensitivity and
specificity. We postulate that the source of this is related to
the normalization procedure that attempts to account for
both alternative isoform expression and length of
transcripts”
Differential gene expression analysis
Normalization
Dispersion estimation
Log fold change estimation
Statistical testing
Filtering
Multiple testing correction
Differential expression analysis:
Normalization
Normalization
For comparing gene expression between genes within a sample,
normalize for
• Gene length
• Gene GC content
For comparing gene expression between samples, normalize for
• Library size (number of reads obtained)
• RNA composition effect
“FPKM and TC are ineffective and should be definitely
abandoned in the context of differential analysis”
“In the presence of high count genes, only DESeq and
TMM (edgeR) are able to maintain a reasonable false
positive rate without any loss of power”
RPKM and FPKM
Reads (or fragments) per kilobase per million mapped reads.
• 20 kb transcript has 400 counts, library size is 20 million reads
RPKM = (400/20) / 20 = 1
• 0.5 kb transcript has 10 counts, library size is 20 million reads
RPKM = (10/0.5) / 20 = 1
Normalizes for gene length and library size
Can be used only for reporting expression values, not for testing
differential expression
• Raw counts are needed to assess the measurement precision
correctly
Normalization by edgeR and DESeq
Aim to make normalized counts for non-differentially
expressed genes similar between samples
• Do not aim to adjust count distributions between samples
Assume that
• Most genes are not differentially expressed
• Differentially expressed genes are divided equally between
up- and down-regulation
Do not transform data, but use normalization factors within
statistical testing
Normalization by edgeR and DESeq – how?
DESeq(2)
• Take geometric mean of gene’s counts across all samples
• Divide gene’s counts in a sample by the geometric mean
• Take median of these ratios sample’s normalization factor
(applied to read counts)
edgeR
• Select as reference the sample whose upper quartile is closest to
the mean upper quartile
• Log ratio of gene’s counts in sample vs reference M value
• Take weighted trimmed mean of M-values (TMM) normalization
factor (applied to library sizes) • Trim: Exclude genes with high counts or large differences in expression
• Weights are from the delta method on binomial data
Differential expression analysis:
Dispersion estimation
Dispersion
When comparing gene’s expression levels between groups, it
is important to know also its within-group variability
Dispersion = (BCV)2
• BCV = gene’s biological coefficient of variation
• E.g. if gene’s expression typically differs from replicate to
replicate by 20% (so BCV = 0.2), then this gene’s dispersion is
0.22 = 0.04
Note that the variance seen in counts is a sum of 2 things:
• Sample-to-sample variation (dispersion)
• Uncertainty in measuring expression by counting reads
How to estimate dispersion reliably?
RNA-seq experiments typically have only a few replicates
it is difficult to estimate within-group variance reliably
Solution: pool information across genes which are expressed
at similar level
Different approaches
• edgeR
• DESeq
• DESeq2
Dispersion estimation by edgeR
Estimates common dispersion for all genes using a
conditional maximum likelyhood approach
Trended dispersion: takes binned common dispersion and
abundance, and fits a curve though these binned values
Tagwise dispersion: uses
empirical Bayes strategy to
shrink gene-wise dispersions
towards the common/trended
one using a weighted
likelyhood approach genes
that are consistent between
replicates are ranked more
highly
Dispersion estimation by DESeq
Models the observed mean-variance relationship for the genes
using either parametric or local regression
User can choose to use the fitted values always, or only when
they are higher than the genewise value
Dispersion estimation by DESeq2
Estimates genewise dispersions using maximum likelyhood
Fits a curve to capture the dependence of these estimates on
the average expression strength
Uses eBayes approach to shrink genewise values towards the
curve.
Shrinkage depends on
• Sample size
Outliers are not shrunk
Differential expression analysis:
Statistical testing
Statistical testing
DESeq and edgeR
• Two group comparisons
• Exact test for negative binomial distribution.
• Multifactor experiments
• Generalized linear model (GLM) likelyhood ratio test. GLM is an
extension of linear models to non-normally distributed response data.
DESeq2
• Shrinks log fold change estimates toward zero using an
eBayes method
• Shrinkage is stronger when counts are low, dispersion is high, or there
are only a few samples
• GLM based Wald test for significance
• Shrunken estimate of log fold change is divided by its standard error and
the resulting z statistic is compared to a standard normal distribution
Filtering
Filter out genes which have little chance of showing
evidence for significant differential expression
• genes which are not expressed
• genes which are expressed at very low level (low counts are
unreliable)
Reduces the severity of multiple testing adjustment
Should be independent
• do not use information on what group the sample belongs to
DESeq2 selects filtering threshold automatically
edgeR result table
logFC = log2 fold change
logCPM = the average log2 counts per million
Pvalue = the raw p-value
FDR = false discovery rate (Benjamini-Hochberg adjustment
for multiple testing)
DESeq result table
baseMean = mean of the counts divided by the size factors
for both conditions
baseMeanA = mean of the counts divided by the size factors
for condition A
baseMeanB= mean of the counts divided by the size factors
for condition B
fold change = the ratio meanB/meanA
log2FoldChange = log2 of the fold change
pval = the raw p-value
padj = Benjamini-Hochberg adjusted p-value
DESeq2 result table
baseMean = mean of counts (divided by size factors) taken
over all samples
log2FoldChange = log2 fold change (FC = ratio meanB/meanA)
lfcSE = standard error of log2 fold change
stat = Wald statistic
pvalue = raw p-value
padj = Benjamini-Hochberg adjusted p-value
Data exploration using MDS plot
edgeR outputs multidimensional scaling (MDS) plot which
shows the relative similarities between samples
Allows you to see if replicates are consistent and if you can
expect to find differentially expressed genes
Distances correspond to the logFC or biological coefficient
of variation (BCV) between each pair of samples
• Calculated using 500 most heterogenous genes (that have
largest tagwise dispersion treating all libraries as one group)
MDS plot by edgeR
Drosophila data from RNAi knock-down of pasilla gene
• 4 untreated samples
• 2 sequenced single end
• 2 sequenced paired end
• 3 samples treated with RNAi
• 1 sequenced single end
• 2 sequenced paired end
Data set for exercises 10-14