microarray and ngs data analysis with chipster€¦ · 2. static images produced by analysis tools...

209
28.-30.10.2014 Eija Korpelainen, Massimiliano Gentile [email protected] Microarray and NGS data analysis with Chipster

Upload: others

Post on 20-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

28.-30.10.2014

Eija Korpelainen, Massimiliano Gentile

[email protected]

Microarray and NGS data analysis with

Chipster

Page 2: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Course outline

Introduction to Chipster

Microarray data analysis

• Importing microarray data to Chipster

• Normalization

• Quality control (also clustering is covered here)

• Filtering

• Statistical testing

• Pathway analysis

• Saving and sharing workflows

Next generation sequencing data analysis (RNA-seq)

• Quality control

• Preprocessing

• Alignment

• Differential expression analysis

Page 3: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Introduction to Chipster

Page 4: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Provides an easy access to over 300 analysis tools

• No programming or command line experience required

Free, open source software

What can I do with Chipster?

• analyze and integrate high-throughput data

• visualize data efficiently

• share analysis sessions

• save and share automatic workflows

Chipster

Page 5: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,
Page 6: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,
Page 7: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Interactive visualizations

Page 8: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Analysis tool overview

140 NGS tools for

• RNA-seq

• miRNA-seq

• exome/genome-seq

• ChIP-seq

• FAIRE/DNase-seq

• MeDIP-seq

• CNA-seq

• Metagenomics (16S rRNA)

140 microarray tools for

• gene expression

• miRNA expression

• protein expression

• aCGH

• SNP

• integration of different data

60 tools for sequence analysis

• BLAST, EMBOSS, MAFFT

• Phylip

Page 9: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Tools for QC, processing and mapping

FastQC

PRINSEQ

FastX

TagCleaner

Trimmomatic

Bowtie

TopHat

BWA

Picard

SAMtools

BEDTools

Page 10: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

RNA-seq tools

Quality control

• RseQC

Counting

• HTSeq

• eXpress

Transcript discovery

• Cufflinks

Differential expression

• edgeR

• DESeq

• Cuffdiff

• DEXSeq

Pathway analysis

• ConsensusPathDB

Page 11: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

miRNA-seq tools

Differential expression

• edgeR

• DESeq

Retrieve target genes

• PicTar

• miRBase

• TargetScan

• miRanda

Pathway analysis for targets

• GO

• KEGG

Correlate miRNA and target expression

Page 12: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Exome/genome-seq tools

Variant calling

• Samtools

Variant filtering

• VCFtools

Variant annotation

• AnnotateVariant (Bioconductor)

Page 13: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

ChIP-seq and DNase-seq tools

Peak detection

• MACS

• F-seq

Peak filtering

• P-value, no of reads, length

Detect motifs, match to JASPAR

• MotIV, rGADEM

• Dimont

Retrieve nearby genes

Pathway analysis

• GO, ConsensusPathDB

Page 14: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

MeDIP-seq tools

Detect methylation, compare two conditions

• MEDIPS

Page 15: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

CNA-seq tools

Count reads in bins

• Correct for GC content

Segment and call CNA

• Filter for mappability

• Plot profiles

Group comparisons

Clustering

Detect genes in CNA

GO enrichment

Integrate with expression

Page 16: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Metagenomics / 16 S rRNA tools

Taxonomy assignment with Mothur package

• Align reads to 16 S rRNA template

• Filter alignment for empty columns

• Keep unique aligned reads

• Precluster aligned reads

• Remove chimeric reads

• Classify reads to taxonomic units

Statistical analyses using R

• Compare diversity or abundance between groups using several

ANOVA-type of analyses

Page 17: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Technical aspects

Client-server system

• Enough CPU and memory for large analysis jobs

• Centralized maintenance

Easy to install

• Client uses Java Web Start

• Server available as a virtual machine

Page 18: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Acknowledgements to users and contibutors

Page 19: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

More info

[email protected]

http://chipster.csc.fi

Page 20: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Chipster: General functionality

Page 21: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Select data

Select tool category

Select tool (set parameters if necessary) and click run

View results

Mode of operation

Page 22: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Analysis history is saved automatically -you can add tool source code to reports if needed

Page 23: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Task manager

You can run many analysis jobs at the same time

Use Task manager to

• view status

• cancel jobs

• view time

• view parameters

Page 24: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Analysis sessions

In order to continue your work later, you have to save the

analysis session.

Saving the session will save all the files and their

relationships. The session is packed into a single .zip file and

saved on your computer (in the next Chipster version you can

also save it on the server).

Session files allow you to continue the work on another

computer, or share it with a colleague.

You can have multiple analysis sessions saved separately, and

combine them later if needed.

Page 25: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Workflow panel

Shows the relationships of the files

You can move the boxes around, and zoom in and out.

Several files can be selected by keeping the Ctrl key down

Right clicking on the data file allows you to

• Save an individual result file (”Export”)

• Delete

• Link to another data file

• Save workflow

Page 26: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Workflow – reusing and sharing your

analysis pipeline

You can save your analysis steps as a reusable automatic

”macro”, which you can apply to another dataset

When you save a workflow, all the analysis steps and their

parameters are saved as a script file, which you can share with

other users

Page 27: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Saving and using workflows

Select the starting point for your

workflow

Select ”Workflow/ Save starting

from selected”

Save the workflow file on your

computer with a meaningful name

• Don’t change the ending (.bsh)

To run a workflow, select

• Workflow->Open and run

• Workflow->Run recent (if you

saved the workflow recently).

Page 28: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Problems? Send us a support request -request includes the error message and link to analysis

session (optional)

Page 29: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Data visualizations

Page 30: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Visualizing the data

Data visualization panel

• Maximize and redraw for better viewing

• Detach = open in a separate window, allows you to view several images at the same time

Two types of visualizations

1. Interactive visualizations produced by the client program

• Select the visualization method from the pulldown menu

• Save by right clicking on the image

2. Static images produced by analysis tools

• Select from Analysis tools/ Visualisation

• View by double clicking on the image file

• Save by right clicking on the file name and choosing ”Export”

Page 31: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Interactive visualizations by the client

Spreadsheet

Histogram

Venn diagram

Scatterplot

3D scatterplot

Volcano plot

Expression profiles

Clustered profiles

Hierarchical clustering

SOM clustering

Genome browser

Available actions:

• Select genes and create a gene list

• Change titles, colors etc

• Zoom in/out

Page 32: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,
Page 33: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,
Page 34: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Static images produced by R/Bioconductor

Box plot

Histogram

Heatmap

Idiogram

Chromosomal position

Correlogram

Dendrogram

NMDS plot

QC stats plot

RNA degradation plot

K-means clustering

SOM-clustering

etc

Page 35: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,
Page 36: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Importing microarray data to Chipster

Page 37: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Importing raw data

Affymetrix CEL-files are recognized by Chipster automatically

You can import Illumina GenomeStudio files to Chipster as is, if all the

samples are in one file

• Need columns AVG, BEAD_STDERR, Avg_NBEADS and DetectionPval

• Note: Use lumi normalization for data imported this way

You can import any tab delimited files (e.g. Agilent) using the Import tool

Page 38: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Import tool

Step 1: Define title row, header and footer

Page 39: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Import tool

Step 2: Define columns

Page 40: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Which columns to mark in Import tool?

http://chipster.csc.fi/manual/import-help.html

Agilent

• Identifier (ProbeName)

• Sample (rMeanSignal or rMedianSignal)

• Sample background (rBGMedianSignal)

• Control (gMeanSignal or gMedianSignal)

• Control background (gBGMedianSignal)

• Flag (Control type)

Illumina BeadStudio version 3 file and GenomeStudio files

• Identifier (ProbeID)

• Sample (text “AVG”)

Illumina BeadStudio version 1-2 file

• Identifier (TargetID)

• Sample (text “AVG”)

1-color 2-color

Page 41: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Importing normalized data

The data should be tab delimited and preferably log-transformed

• If your data is not log-transformed, you can transform it with the tool

“Change interpretation”

Bring the data file in using the Import tool. Mark the identifier column

and all the sample columns.

Run the tool Normalize / Process prenormalized. This

• Converts data to Chipster format by adding ”chip.” to expression

column names

• Creates the phenodata file. Indicate chiptype using names given at

http://chipster.csc.fi/manual/supported-chips.html

Page 42: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Exercise 1. Import Illumina data in two ways

Import Illumina data directly and normalize with the lumi tool

• Select File / Import files.

• Select the file IlluminaHuman6v1_BS1.tsv.

• In the Import files -window choose the action "Import directly“

• Select the file and the tool Normalization/ Illumina – lumi pipeline.

Set the chiptype parameter to Human and click Run.

Import the same file using the Import tool and normalize

• In the Import files -window choose the action “Use Import tool”

• Click the Mark header –button and paint the header rows.

• Click the Mark title row –button and click on the title row of the data.

• Click Next. Click the Identifier –button and click on the TargetID

column. Click the Sample –button and click on all the AVG columns.

• Select the 8 files and tool Normalization/ Illumina. Set parameters: • Illumina software version = BeadStudio1

• identifier type = TargetID and chiptype = Human-6v1.

Page 43: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Microarray data analysis flow chart

Page 44: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Normalization

Page 45: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Normalization methods for different arrays

Affymetrix

• Background correction + expression estimation +

summarization

• RMA (Robust Multichip Averaging) uses only PM probes,

fits a model to them, and gives out expression values after

quantile normalization and median polishing

Agilent

• Background correction + averaging duplicate spots +

normalization

Illumina

• Background correction (in GenomeStudio) + normalization

NOTE: After normalization the expression values are in log2-scale

• Hence a fold change of 2 means 4-fold up, -2 means 4-fold

down, etc

Page 46: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Normalization of Affymetrix data

Normalization = background correction + expression estimation + summarization

Methods: MAS5, Plier, RMA, GCRMA, Li-Wong

• RMA is the default, and works rather nicely if you have more than a few chips

• GCRMA is similar to RMA, but takes also GC% content into account

• MAS5 is the older Affymetrix method, Plier is a newer one

• Li-Wong is the method implemented in dChip

Variance stabilization makes the variance over all the chips similar

• Works only with MAS5 and Plier (the other methods log2-transform the data,

which corrects for the same phenomenon)

Custom chiptype

• Because some of the Affymetrix probe-to-transcript mappings are not correct,

probes have been remapped in the Bioconductor project. To use these

remappings (alt CDF environments), select the matching chiptype from the

Custom chiptype menu.

Page 47: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Normalization of Agilent data

Background treatment often generates negative values, which are

coded as missing values after log2-transformation.

• Usual subtract option does this

• Using normexp + offset 50 will not generate negative values,

and it gives good estimates (the best method reported)

Loess removes curvature from the data (recommended)

Before After

Page 48: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Agilent normalization parameters in Chipster

Background correction

• Background treatment

None, Subtract, Edwards, Normexp

• Background offset

0 or 50

Normalize chips

• None, median, loess

Normalize genes

• None, scale (to median), quantile

• not needed for statistical analysis

Chiptype

• You must give this information in order to use annotation-based

tools later

Page 49: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Illumina normalization in Chipster: 2 options

Normalization / Illumina

• Normalization method None, scale, quantile, vsn (variance stabilizing normalization)

• Illumina software version GenomeStudio or BeadStudio3, BeadStudio2, BeadStudio1

• Chiptype

• Identifier type Target ID, Probe ID (for BeadStudio version 3 data and newer)

Normalization / Illumina - Lumi pipeline

• Transformation none, vst (variance stabilizing transformation), log2

• Normalize chips none, rsn (robust spline normalization), loess, quantile, vsn

• Chiptype human, mouse, rat

• Background correction none, bgAdjust.Affy

Page 50: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Quantile normalization procedure

Sample A Sample B Sample C

Gene 1 20 10 350

Gene 2 100 500 200

Gene 3 300 400 30

Sample A Sample B Sample C Median

Quantile 1 20 10 30 20

Quantile 2 100 400 200 200

Quantile 3 300 500 350 350

Sample A Sample B Sample C Median

Quantile 1 20 20 20 20

Quantile 2 200 200 200 200

Quantile 3 350 350 350 350

Sample A Sample B Sample C

Gene 1 20 20 350

Gene 2 200 350 200

Gene 3 350 200 20

1. Raw data 2. Rank data within sample and calculate median intensity for each row

3. Replace the raw data of each row with its median (or mean) intensity 4. Restore the original gene order

Page 51: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Checking normalization

Page 52: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Exercise 2: Normalize Illumina data

Import folder IlluminaTeratospermiaHuman6v1_BS1 using the Import

tool like in exercise 1.

In the workflow view, click on the box ”13 files” to select all of them

Select the tool Normalization / Illumina. Set parameters so that

• Illumina software version = BeadStudio1

• identifier type = TargetID

• chiptype = Human-6v1

Repeat the run as before, but change

• Normalization method = none

• This ”mock-normalization” gives you unnormalized data that you can use

as a comparison when looking at the normalization effect later in the QC

exercise.

Page 53: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Phenodata: Describing the experimental

setup

Page 54: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Phenodata file

Experimental setup is described with a phenodata file, which is created during

normalization

Fill in the group column with numbers describing your experimental groups

• e.g. 1 = control sample, 2 = cancer sample

• necessary for the statistical tests to work

• note that you can sort a column by clicking on its title

Page 55: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

How to describe pairing, replicates, time, etc?

You can add columns to the phenodata file

Time

• Use either real time values or recode with group codes

Replicates

• All the replicates are coded with the same number

Pairing

• Pairs are coded using the same number for each pair

Page 56: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Phenodata for prenormalized data

If you bring in previously created normalized data and phenodata:

• Choose ”import directly” in the Import tool

• Right click on normalized data, choose ”Link to phenodata”

If you brought in normalized data and need to create phenodata for it:

• Use Import tool to bring the data in

• Use the tool Normalize / Process prenormalized to create

phenodata

• Remember to give the chiptype

• Fill in the group column

Page 57: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Exercise 3: Describe the experiment

Double click the phenodata file of the real normalization

In the phenodata editor, enter 1 in the group column for the control

samples and 2 for the teratospermia affected samples.

For the interest of visualizations later on, give shorter names for the

samples in the Description column

• Name the teratospermia samples t1, t2,….t8

• Name the controls samples c1, c2 ,..c5

Page 58: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Quality control at array level

Page 59: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Quality control tools in Chipster

Quality control category (array level QC)

• Affymetrix basic (RNA degradation + Affy QC)

• Affymetrix RLE and NUSE fit a model to expression values

• Illumina (density plot + boxplot)

• Agilent 2-color (MA-plot + density plot + boxplot)

• Agilent 1-color (density plot + boxplot)

Visualization category (experiment level QC)

• Dendrogram

• Correlogram

Statistics category (experiment level QC)

• Non-metric multidimensional scaling (NMDS)

• Principal components analysis (PCA)

Page 60: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

QC for Affymetrix data

Affymetrix QC metrix

RNA degradation

Spike-in controls linearity

RLE (relative log expression)

NUSE (normalized unscaled standard error plot)

Page 61: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Affymetrix QC metrix

Blue area shows where scaling

factors are less than 3-fold of the

mean.

•If the scaling factors or ratios fall

within this region (1.25-fold for

GADPH), they are colored blue,

otherwise red

average background on the chip

probesets with present flag

scaling factors for the chips

beta-actin 3':5' ratio

GADPH 3':5' ratio

Page 62: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Affymetrix spike-ins and RNA degradation

Spike-in linearity

RNA degradation plot

Page 63: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Affymetrix: RLE and NUSE

RLE (relative log expression)

NUSE (normalized unscaled standard error)

Page 64: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Agilent / Illumina: density plot and box plot

Page 65: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Scatter plot of log intensity ratios M=log2(R/G) versus average log intensities

A = log2 (R*G), where R and G are the intensities for the sample and control,

respectively

M is a mnemonic for minus, as M = log R – log G

A is mnemonic for add, as A = (log R + log G) / 2

Agilent QC: MA-plot

Page 66: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Exercise 4: Illumina array level quality control

Run Quality control / Illumina for the normalized data

Repeat this for the ”mock-normalized” data and compare the results

(use the Detach button to view the images side by side). Can you see

the effect of normalization?

Page 67: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Quality control at experiment level

(inc. clustering theory)

Page 68: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Principal component analysis (PCA)

Goal

Method

Project a high dimensional space into a lower dimensional space

Compute a variance-covariance

matrix for all variables (genes)

The first principal component is

the linear combination of variables

that maximizes the variance

The linear combination,

orthogonal to the first, that

maximizes variance is the second

principal component. Etc.

Page 69: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Z

Y X

Z-Y

Z-X

Y-X

Explains most of the

variability in the shape

of the pen

X is the first principal

component of the pen

PCA illustration

X-Y

Page 70: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Z

Y X

Z-Y

Z-X

Y-X

Explains most of the

remaining variability in

the shape of the pen

Y is the second principal

component of the pen

PCA illustration, continued

Page 71: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Non-metric multidimensional scaling (NMDS)

Goal

Method

Project a high dimensional space into a lower dimensional space

Compute a distance matrix for all

variables (genes)

Define number of dimensions of

reduced space

Construct the dimensions as to

maximise the similarity of

distances between the high and

lower dimensional space

Compared to PCA: Allows choice

of distance metric

better agreement with

clustering methods

Page 72: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Dendrogram and correlogram

Dendrogram

Correlogram

Page 73: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Hierarchical clustering parameters

Distance method

- Euclidean

- Pearson correlation

- Spearman correlation

- Manhattan correlation

Drawing method

- Single linkage

- Average linkage

- Complete linkage

Page 74: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Pearson correlation Euclidean distance

Hierarchical clustering distance methods

One can either calculate the distance between two pairs of data

sets (e.g. samples) or the similarity between them

Page 75: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Distance methods can yield very different results

Page 76: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Correlations are sensitive to outliers (use Spearman)!

Page 77: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Hierarchical clustering drawing methods

single linkage average linkage complete linkage

Page 78: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Hierarchical clustering (euclidean distance)

calculate

distance

matrix

gene 1 gene 2 gene 3 gene 4

gene 1 0gene 2 2 0gene 3 8 7 0gene 4 10 12 4 0

calculate averages of

most similar

calculate averages of

most similar

gene 1,2 gene 3 gene 4

gene 1,2 0gene 3 7.5 0gene 4 11 4 0

gene 1,2 gene 3,4

gene 1,2 0gene 3,4 9.25 0

Page 79: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

calculate

distance

matrix

gene 1 gene 2 gene 3 gene 4

gene 1 0gene 2 2 0gene 3 8 7 0gene 4 10 12 4 0

gene 1,2 gene 3 gene 4

gene 1,2 0gene 3 7.5 0gene 4 11 4 0

gene 1,2 gene 3,4

gene 1,2 0gene 3,4 9.25 0

1 2 3 4

calculate averages of

most similar

calculate averages of

most similar

Dendrogram

Hierarchical clustering (avg. linkage)

Page 80: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

1 2 3 4 5 1 2 3 4 5

1 2 3 4 5 1 2 3 4 5

When assessing similarity, look at the branching pattern instead of

sample order

Page 81: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Exercise 5: Experiment level quality control

Run Statistics / NMDS for the normalized data

• Do the groups separate along the first dimension?

Run Statistics / PCA on the normalized data.

• View pca.tsv as ”3D scatter plot for PCA”. Can you see 2 groups?

• Check in variance.tsv how much variance the first principal component

explains?

Run Visualization / Dendrogram for the normalized data

• Do the groups separate well?

Save the analysis session with name sessionTeratospermia.zip

Page 82: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Filtering

Page 83: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Independent filtering

Independent filtering does no take into account the group

information

Removing probes for genes that don’t have any chance of

being differentially expressed

• Not expressed

• Not changing

Reduces the severity of multiple testing correction

• Some controversy on whether filtering should be used or not…

Page 84: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Filtering tools in Chipster

Filter by standard deviation (SD)

• Select the percentage of genes to be filtered out

Filter by coefficient of variation (CV = SD / mean)

• Select the percentage of genes to be filtered out

Filter by flag

• Flag value and number of arrays

Filter by expression

• Select the upper and lower cut-offs

• Select the number of chips required to fulfil this rule

• Select whether to return genes inside or outside the range

Filter by interquartile range (IQR)

• Select the IQR

Page 85: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Venn diagram Select 2-3 datasets and the visualization method Venn diagram

• Can use Venn also for filtering with your own gene list

identifier list

Page 86: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Exercise 6: Filtering

Select the normalized data and play with different filters. In order to

compare the results, set the cutoffs so that you get approximately

the same amount of filtered genes (for example 0.9 for SD and CV,

and 1.1 for IQR)

• Preprocessing / Filter by SD

• Preprocessing / Filter by CV

• Preprocessing / Filter by IQR

Select the result files and compare them using the interactive Venn

diagram visualization

• Save the genes specific to SD filter to a new list

• Save the genes specific to CV filter to a new list

• View both as expression profiles. Is there a difference in

expression levels of the two sets?

Page 87: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Statistical testing

Page 88: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Statistical analysis of microarray data: Why?

Distinguish biological and measurement variation from

treatment effect

Generalisation of results

replication

estimation of uncertainty (variability)

representative sample

statistical inference

sampling inference

Page 89: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Expression

Comparing means (1-2 groups)

student’s t

Comparing means (>2 groups)

1-way ANOVA

Comparing means (multifactor)

2-way ANOVA

Expression

Group A B

Parametric methods Non-parametric methods

Comparing ranks (2 groups)

Comparing ranks (>2 groups)

Kruskal-Wallis

Mann-Whitney

Ranks

group A group B

1 6

2 7

3 8

4 9

5 10

Page 90: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Parametric statistics

0 200 400 600 800 1000 1200

0.0

00

0.0

02

0.0

04

0.0

06

0.0

08

dnorm

(x, m

ean =

250, sd

= 5

0)

x1 x2

s1

s2

H0 : A = B , A - B = 0

H1 : A B

2

2

2

1

2

1

21

n

s

n

s

xxt

Type 1 error, a

Type 2 error, b

Power = 1 - b

Page 91: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Comparing ranks (2 groups)

Comparing ranks (>2 groups)

Kruskal-Wallis

B

Mann-Whitney

Expression

Group

A

Non-parametric statistics

Ranks

group A group B

1 4

2 6

3 7

5 9

8 10

111

2112

)1(** R

nnnnU

222

2122

)1(** R

nnnnU

Page 92: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Benefits

Non-parametric compared to parametric tests

Drawbacks

Do not make any assumptions on data distribution

robust to outliers

allows for cross-experiment comparisons

Lower power than parametric counterpart

Granular distribution of calculated statistic

many genes get the same rank

requires at least 6 samples / group

Page 93: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Multiple testing problem

Problem

1 gene, a = 0.05

false positive incidence = 1 / 20

30 000 genes, a = 0.05

false positive incidence = 1500

Solution

Bonferroni

Holm (step down)

Westfall & Young

Benjamini & Hochberg

more false negatives

more false positives

Page 94: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Multiple testing correction methods

Bonferroni

corrected p-value = p * n

a = a / n

correction

raw p

n

1

Holm (step down)

rank p-values

highest ranked p-value = p * n

next lower ranked p-value = p * (n-1)

.

.

lowest ranked p-value = p (n-(n-1)) = p

correction

raw p

n

1

Page 95: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Multiple testing correction methods II

Benjamini & Hochberg rank p-values from smallest to largest

largest p-value remains unaltered

second largest p-value = p * n / (n-1)

third largest p-value = p * n / (n-2)

.

.

smallest p-value = p * n / (n-n+1) = p * n

raw p

n

1

correction

Page 96: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Comparison of multiple testing correction methods

-3 -2 -1 0 1 2

1 e

-04

1 e

-02

1 e

+0

0

No correction

Log2 (ratio)

-lo

g2

(p

-va

lue

)

-3 -2 -1 0 1 2

1 e

-04

1 e

-02

1 e

+0

0

Bonferroni

Log2 (ratio)

-lo

g2

(p

-va

lue

)

-3 -2 -1 0 1 2

1 e

-04

1 e

-02

1 e

+0

0

FDR

Log2 (ratio)-l

og

2 (

p-v

alu

e)

Page 97: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Exercise 7: Statistical testing

Run different two group tests

• Select the file sd-filter.tsv and run Statistics / Two group test with the default parameter setting

• Repeat the run but change the parameter ”test” to t-test and next to Mann-Whitney. Rename the results to t.tsv and MW.tsv, respectively.

Compare the results with a Venn diagram

• Which method seems most powerful?

• Select the genes common to all three datasets and create a new dataset

Page 98: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Exercise 8: Viewing statistical testing results

View the Empirical Bayes result as an interactive volcano plot

• Select the two-sample.tsv and visualization method volcano plot

Filter based on fold change

• Select the file two-sample.tsv and run the tool Preprocessing / Filter using a column value so that you keep genes that have log2 fold change higher than +/-3 (select FC column, cutoff = 3, and ”outside”). View the result file as an expression profile.

Page 99: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

More accurate estimates of effect size and variability

How to improve statistical power?

Variance shrinking

(Empirical Bayes)

Partitioning variability

(ANOVA, linear modeling)

Increase number of replicates (biological)

Randomization

Blocking

Experimentally

Analytically

Page 100: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Improving power with variance shrinking

Concept

Borrow information from other genes with similar expression level and form

a pooled error estimate

How ?

model the error-intensity dependence based on replicate to replicate

comparisons

use a smoothing function to estimate the error for any given intensity

calculate a weighted average between observed gene specific variance

and model-derived variance (pooling)

incorporate the pooled variance estimate in the statistical test (usually t-

or F-test)

Page 101: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Linear modeling: Pairing

unpaired Experimental

Group 1

Experimental

Group 2

2 3

2 4

3 2

1 3

2 3

0.8 0.8

Mean

sd

paired Experimental

Group 1

Experimental

Group 2

Difference

2 3 1

2 3 1

3 4 1

1 2 1

One sample

T-test

Page 102: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Linear modeling: Multiple factors

1 factor

2 factors Experimental

Group 1

Experimental

Group 2

Males

2

3

1

6

7

5

Mean 2 6

Females

8

9

7

4

5

3

Mean 8 4

Experimental

Group 1

Experimental

Group 2

2 5

9 7

1 3

7 5

8 4

3 6

5 5 Mean

Page 103: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Linear modeling: Interaction effect

Group mean

Group

Males Females

exp. group 1

exp. group 2

2

4

6

8

10

Page 104: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Linear modeling in Chipster

Page 105: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Linear modeling: Setting up the model

Page 106: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Annotation

Page 107: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Annotation

Gene annotation = information about biological function, pathway

involvement, chromosal location etc

Annotation information is collected from different biological databases

to a single database by the Bioconductor project

Annotation information is required by certain analysis tools

(annotation, GO/KEGG enrichment, promoter analysis, chromosomal

plots)

• These tools don’t work for those chiptypes which don’t have

Bioconductor annotation packages

Page 108: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,
Page 109: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Alternative CDF environments for Affymetrix

CDF is a file that links individual probes to gene transcripts

Affymetrix default annotation uses old CDF files that map many

probes to wrong genes

Alternative CDFs fix this problem

In Chipster selecting ”custom chiptype” in Affymetrix

normalization takes altCDFs to use

For more information see

• Dai et al, (2005) Nuc Acids Res, 33(20):e175: Evolving

gene/transcript definitions significantly alter the interpretation of

GeneChip data

• http://brainarray.mbni.med.umich.edu/Brainarray/Database/Cust

omCDF/genomic_curated_CDF.asp

Page 110: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Also a problem with Illumina

Probes are remapped in the R/Bioconductor project

Chipster uses remapped probes

For more information see

• Barbosa-Morais NL, Dunning MJ, Samarajiwa SA, Darot JFJ,

Ritchie ME, Lynch AG, Tavaré S. "A re-annotation pipeline for

Illumina BeadArrays: improving the interpretation of gene

expression data". Nucleic Acids Research, 2009 Nov 18,

doi:10.1093/nar/gkp942

• https://prod.bioinformatics.northwestern.edu/nuID/

Page 111: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Exercise 9: Annotation

Annotate genes

• Select the file column-value-filter.tsv

• Run Annotation / Illumina gene list

• Open the result file annotations.html and click the links in the

gene and pathway columns to read more about one of the

genes

Page 112: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Pathway analysis

Page 113: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Pathway analysis – why?

Statistical tests can yield thousands of differentially expressed

genes

It is difficult to make ”biological” sense out of the result list

Looking at the bigger picture can be helpful, e.g. which

pathways are differentially expressed between the

experimental groups

Databases such as KEGG, GO, Reactome and

ConsensusPathDB provide grouping of genes to pathways,

biological processes, molecular functions, etc

Two approaches to pathway analysis

• Gene set enrichment analysis

• Gene set test

Page 114: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Approach I: Gene set enrichment analysis

Apoptosis (200 genes)

Our list (50 genes)

Genome (30,000 genes)

Our list and apoptosis (10 genes)

H0 : = 30 >> 1 50 10 ____

_______

30000 200 ____

1. Perform a statistical test to find differentially expressed genes

2. Check if the list of differentially expressed genes is ”enriched”

for some pathways

Page 115: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Approach II: Gene set test

1. Do NOT perform

differential gene

expression analysis

2. Group genes to pathways

and perform differential

expression analysis for

the whole pathway

Advantages

• More sensitive than single

gene tests

• Reduced number of tests

-> less multiple testing

correction -> increased

power

Page 116: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

ConsensusPathDB

One-stop shop: Integrates pathway information from 32

databases covering

• biochemical pathways

• protein-protein, genetic, metabolic, signaling, gene regulatory

and drug-target interactions

Developed by Ralf Herwig’s group at the Max-Planck Institute

in Berlin

ConsensusPathDB over-representation analysis tool is

integrated in Chipster

• runs on the MPI server in Berlin

Page 117: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

GO (Gene Ontology)

Controlled vocabulary of terms for describing gene product

characteristics

3 ontologies

• Biological process

• Molecular function

• Cellular component

Hierarchical structure

Page 118: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

KEGG

Kyoto Encyclopedia for Genes and Genomes

Collection of pathway maps representing molecular

interaction and reaction networks for

• Metabolism

• cellular processes

• diseases, etc

Page 119: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Exercise 10: Gene set enrichment analysis

Identify over-represented GO terms

• Select the two-sample.tsv file

• Run Pathways / Hypergeometric test for GO

Extract genes for a specific GO term

• Open the hypergeo.tsv file and copy the GO id number for the top term.

• Select two-sample.tsv and run tool Utilities / Extract genes from GO,

pasting the GO id into the parameter field.

• Open the extracted-from-GO.tsv in Spreadsheet and Expression

profile view.

Identify over-represented ConsensusPathDB pathways

• Select two-sample.tsv and run tool and Pathways / Hypergeometric

test for ConsensusPathDB.

• Click on the hyperlinks in the cpdb.html tile to get more info on a

particular pathway.

Page 120: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Exercise 11: Gene set test

Identify differentially expressed KEGG pathways

• Select the normalized.tsv file and Pathways / Gene set test. Set

the number of groups to visualize = 4

• Explore the results both in tabular fromat and graphically in the

global-test-result-table.tsv and multtest.png files, respectively.

Page 121: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Exercise 12: Saving a workflow

Prune your workflow if necessary (remove cyclic structures)

Select the file normalized.tsv and click on the Workflow / Save

starting from selected. Give your workflow a meaningful name

and save it.

Page 122: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Illumina data analysis summary

Normalization

• lumi method

Quality control at array level: are there outlier arrays?

• density graph and boxplot

Quality control at experiment level: do the sample groups

separate?

• PCA, NMDS, dendrogram

Independent filtering of genes

• e.g. 50% based on coefficient of variation

• Depends on the statistical test to be used later

Statistical testing

• Empirical Bayes method (two group test / linear modeling)

Annotation, pathway analysis, promoter analysis, clustering,

classification…

Page 123: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Next generation sequencing data analysis

(RNA-seq)

Page 124: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Outline

1. Introduction to RNA-seq

2. RNA-seq data analysis

• Quality control, preprocessing

• Alignment (=mapping) to reference genome

• Manipulation of alignment files

• Alignment level quality control

• Visualization of alignments in genome browser

• Quantitation

• Differential expression analysis

Page 125: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Introduction to RNA-seq

Page 126: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

What can I measure with RNA-seq?

Which genes are over-expressed (or under-expressed) in

patients vs. healthy controls?

Which genes are correlated with disease progression?

Which genes undergo isoform switching in disease?

Find new genes and isoforms

Etc etc

Page 127: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Is RNA-seq better than microarrays?

+ Wider detection range

+ Can detect new genes and isoforms

- Data analysis is not as established as for microarrays

- Data is voluminous

Page 128: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Typical steps in RNA-seq

http://cmb.molgen.mpg.de/2ndGenerationSequencing/Solas/RNA-seq.html

Page 129: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

RNA-seq data analysis

Gene A Gene B

Align reads to reference genome

Match alignment positions with known gene positions

Count how many reads each gene has

A = 6 B = 11

Page 130: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

RNA-seq data analysis: file formats

Gene A Gene B

Fastq (raw reads)

Fasta (ref. genome)

GTF (ref. annotation)

BAM (aligned reads)

A = 6 B = 11 tsv (count table)

Page 131: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,
Page 132: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

RNA-seq data analysis workflow

Page 133: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Quality control, preprocessing

(FastQC, PRINSEQ, Trimmomatic)

Align reads to reference

(TopHat, STAR)

RNA-seq data analysis workflow

De novo assembly

(Trinity, Velvet+Oases)

Align reads to transcripts

(Bowtie)

Reference based

assembly to detect new

transcripts and isoforms

(Cufflinks)

Quantitation

(HTSeq, eXpress, etc)

reference

Annotation

(Blast2GO)

Differential expression analysis

(edgeR, DESeq, Cuffdiff)

reads

Page 134: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Quality control, preprocessing

(FastQC, PRINSEQ, Trimmomatic)

Align reads to reference

(TopHat, STAR)

RNA-seq data analysis workflow today

De novo assembly

(Trinity, Velvet+Oases)

Align reads to transcripts

(Bowtie)

Reference based

assembly to detect new

transcripts and isoforms

(Cufflinks)

Quantitation

(HTSeq, eXpress, etc)

reference

Annotation

(Blast2GO)

Differential expression analysis

(edgeR, DESeq, Cuffdiff)

reads

Page 135: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Things to take into account

Non-uniform coverage along transcripts

• Biases introduced in library construction and sequencing

• polyA capture and polyT priming can cause 3’ bias

• random primers can cause sequence-specific bias

• GC-rich and GC-poor regions can be under-sampled

• Genomic regions have different mappabilities (uniqueness)

Longer transcripts give more counts

RNA composition effect due to sampling:

Page 136: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Quality control of raw reads

Page 137: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

What and why?

Potential problems

• low confidence bases, Ns

• sequence specific bias, GC bias

• adapters

• sequence contamination

• …

Knowing about potential problems in your data allows you to

correct for them before you spend a lot of time on analysis

take them into account when interpreting results

Page 138: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Software packages for quality control

FastQC

FastX

PRINSEQ

TagCleaner

...

Page 139: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Raw reads: FASTQ file format

Four lines per read:

• Line 1 begins with a '@' character and is followed by a sequence identifier.

• Line 2 is the sequence.

• Line 3 begins with a '+' character and can be followed by the sequence identifier.

• Line 4 encodes the quality values for the sequence, encoded with a single ASCII

character for brevity.

• Example:

@read name

GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT

+ read name

!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

http://en.wikipedia.org/wiki/FASTQ_format

Page 140: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Base qualities

If the quality of a base is 30, the probability that it is wrong is 0.001.

• Phred quality score Q = -10 * log10 (probability that the base is wrong)

T C A G T A C T C G

40 40 40 40 40 40 40 40 37 35

Encoded as ASCII characters so that 33 is added to the Phred score

• This ”Sanger” encoding is used by Illumina 1.8+, 454 and SOLiD

• Note that older Illumina data uses different encoding

• Illumina1.3: add 64 to Phred

• Illumina 1.5-1.7: add 64 to Phred, ASCII 66 ”B” means that the whole read segment

has low quality

Page 141: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Base quality encoding systems

http://en.wikipedia.org/wiki/FASTQ_format

Page 142: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Per position base quality (FastQC)

good

ok

bad

Page 143: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Per position base quality (FastQC)

Page 144: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Per position sequence content (FastQC)

Page 145: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Sequence specific bias: Correct sequence but biased

location, typical for Illumina RNA-seq data

Per position sequence content (FastQC)

Page 146: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Preprocessing: Filtering and trimming low

quality reads

Page 147: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Filtering vs trimming

Filtering removes the entire read

Trimming removes only the bad quality bases

• It can remove the entire read, if all bases are bad

Trimming makes reads shorter

• This might not be optimal for some applications

Paired end data: the matching order of the reads in the two files

has to be preserved

• If a read is removed, its pair has to removed as well

Page 148: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

What base quality threshold should be used?

No consensus yet

Trade-off between having good quality reads and having enough

sequence

Page 149: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Software packages for preprocessing

FastX

PRINSEQ

TagCleaner

Trimmomatic

Cutadapt

TrimGalore!

...

Page 150: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

PRINSEQ filtering possibilities in Chipster

Base quality scores

• Minimum quality score per base

• Mean read quality

Ambiguous bases

• Maximum count/ percentage of Ns that a read is allowed to have

Low complexity

• DUST (score > 7), entropy (score < 70)

Length

• Minimum length of a read

Duplicates

• Exact, reverse complement, or 5’/3’ duplicates

Copes with paired end data

Page 151: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

PRINSEQ trimming possibilities in Chipster

Trim based on quality scores

• Minimum quality, look one base at a time

• Minimum (mean) quality in a sliding window

• From 3’ or 5’ end

Trim x bases from left/ right

Trim to length x

Trim polyA/T tails

• Minimum number of A/Ts

• From left or right

Copes with paired end data

Page 152: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

We want to find genes which are differentially expressed between two human cell lines, h1-hESC and GM12878

Illumina data from the ENCODE project

• 75 b single-end reads

• For the interest of time, only a small subset of reads is used

• No replicates (Note: you should always have at least 3 replicates!)

Data set for exercises 1-9

Page 153: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Aligning reads to genome/transcriptome

Page 154: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Alignment to reference genome/transcriptome

Goal is to find out where a read originated from

• Challenge: variants, sequencing errors, repetitive sequence

Mapping to

• transcriptome allows you to count hits to known transcripts

• genome allows you to find new genes and transcripts

Many organisms have introns, so RNA-seq reads map to

genome non-contiguously spliced alignments needed

• Difficult because sequence signals at splice sites are limited

and introns can be thousands of bases long

Page 155: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Tens of aligners are available

http://wwwdev.ebi.ac.uk/fg/hts_mappers/

Page 156: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Splice-aware aligners

TopHat (uses Bowtie)

STAR

GSNAP

RUM

MapSplice

...

Nature methods 2013 (10:1185)

Page 157: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Mapping quality

Confidence in read’s point of origin

Depends on many things, including

• uniqueness of the aligned region in the genome

• length of alignment

• number of mismatches and gaps

Expressed in Phred scores, like base qualities

• Q = -10 * log10 (probability that mapping location is wrong)

Page 158: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Bowtie2

Fast and memory efficient aligner

Can make gapped alignments (= can handle indels)

Cannot make spliced alignments, but is used by TopHat2 which can

Two alignment modes:

• End-to-end

• Read is aligned over its entire length

• Maximum alignment score = 0, deduct penalty for each mismatch (less for low

quality base), N, gap opening and gap extension

• Local

• Read ends don’t need to align, if this maximizes the alignment score

• Add bonus to alignment score for each match

Reference (genome) is indexed to speed up the alignment process

Page 159: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

TopHat2

Relatively fast and memory efficient spliced aligner

Performs several alignment steps

Uses Bowtie2 end-to-end mode for aligning

• Low tolerance for mismatches

If annotation (GTF file) is available, builds a virtual transcriptome

and aligns reads to that first

Kim et al, Genome Biology 2013

Page 160: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

TopHat2 spliced alignment steps

Kim et al, Genome Biology 2013

Page 161: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,
Page 162: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

File format for aligned reads: BAM/SAM

SAM (Sequence Alignment/Map) is a tab-delimited text file. BAM is a

binary form of SAM.

Optional header (lines starting with @)

One line for each alignment, with 11 mandatory fields:

• read name, flag, reference name, position, mapping quality, CIGAR,

mate name, mate position, fragment length, sequence, base qualities

• CIGAR reports match (M), insertion (I), deletion (D), intron (N), etc

Example:

@HD VN:1.3 SO:coordinate

@SQ SN:ref LN:45

r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *

• The corresponding alignment

Ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT

r001 TTAGATAAAGGATA*CTG

Page 163: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,
Page 164: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

BAM file (.bam) and index file (.bai)

BAM files can be sorted by chromosomal coordinates and

indexed for efficient retrieval of reads for a given region.

The index file must have a matching name. (e.g. reads.bam

and reads.bam.bai)

Genome browser requires both BAM and the index file.

The alignment tools in Chipster automatically produce sorted

and indexed BAMs.

When you import BAM files, Chipster asks if you would like to

preproces them (convert SAM to BAM, sort and index BAM).

Page 165: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Manipulating BAM files (SAMtools, Picard)

Convert SAM to BAM, sort and index BAM

• ”Preprocessing” when importing SAM/BAM, runs on your computer.

• The tool available in the ”Utilities” category runs on the server.

Index BAM

Statistics for BAM

• How many reads align to the different chromosomes.

Count alignments in BAM

• How many alignments does the BAM contain.

• Includes an optional mapping quality filter.

Retrieve alignments for a given chromosome/region

• Makes a subset of BAM, e.g. chr1:100-1000, inc quality filter.

Create consensus sequence from BAM

Page 166: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Region file formats: BED

5 obligatory columns: chr, start, end, name, score

0-based, like BAM

Page 167: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Region file formats: GFF/GTF

9 obligatory columns: chr, source, name, start, end, score,

strand, frame, attribute

1-based

Page 168: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Quality control of aligned reads

Page 169: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Quality metrics for aligned reads

How many reads mapped to the reference? How many of them

mapped uniquely?

How many pairs mapped? How many mapped concordantly?

Mapping quality distribution?

Saturation of sequencing depth

• Would more sequencing detect more genes and splice junctions?

Read distribution between different genomic features

• Exonic, intronic, intergenic regions

• Coding, 3’ and 5’ UTR exons

• Protein coding genes, pseudogenes, rRNA, miRNA, etc

Coverage uniformity along transcripts

Page 170: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Quality control programs for aligned reads

RseQC (available in Chipster)

RNA-seqQC

Qualimap

Picards’s CollectRnaSeqMetrics

Page 171: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Visualization of reads and results in

genomic context

Page 172: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Software packages for visualization

Chipster genome browser

IGV

UCSC genome browser

....

Differences in memory consumption, interactivity,

annotations, navigation,...

Page 173: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Chipster Genome Browser

Integrated with Chipster analysis environment

Automatic sorting and indexing of BAM, BED and GTF files

Automatic coverage calculation (total and strand-specific)

Zoom in to nucleotide level

Highlight variants

Jump to locations using BED, GTF and tsv files

View details of selected BED and GTF features

Several views (reads, coverage profile, density graph)

Page 174: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,
Page 175: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,
Page 176: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,
Page 177: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Quantitation

Page 178: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Software for counting aligned reads per

genomic features (genes/exons/transcripts)

HTSeq

Cufflinks

BEDTools

Qualimap

eXpress

...

Page 179: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

HTSeq count

Given a BAM file and a list of genomic features (e.g. genes),

counts how many reads map to each feature.

• For RNA-seq the features are typically genes, where each gene is

considered as the union of all its exons.

• Also exons can be considered as features, e.g., in order to check

for alternative splicing.

Features need to be supplied in GTF file

• Note that GTF and BAM must use the same chromosome naming

3 modes to handle reads which overlap several genes

• Union (default)

• Intersection-strict

• Intersection-nonempty

Page 180: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

HTSeq count modes

Page 181: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Estimating gene expression at gene level - the isoform switching problem

Trapnell et al. Nature Biotechnology 2013

Page 182: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Differential expression analysis

Page 183: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Things to take into account

Biological replicates are important!

Normalization is required in order to compare expression

between samples

• Different library sizes

• RNA composition bias caused by sampling approach

Model has to account for overdispersion in biological

replicates negative binomial distribution

Raw counts are needed to assess measurement precision

• Counts are the ”the units of evidence” for expression

Multiple testing problem

Page 184: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,
Page 185: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Software packages for DE analysis

edgeR

DESeq

DEXSeq

Cuffdiff

Ballgown

SAMseq

NOIseq

Limma + voom, limma + vst

...

Page 186: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,
Page 187: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Comments from comparisons

”Methods based on negative binomial modeling have

improved specificity and sensitivities as well as good

control of false positive errors”

”Cuffdiff performance has reduced sensitivity and

specificity. We postulate that the source of this is related to

the normalization procedure that attempts to account for

both alternative isoform expression and length of

transcripts”

Page 188: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Differential gene expression analysis

Normalization

Dispersion estimation

Log fold change estimation

Statistical testing

Filtering

Multiple testing correction

Page 189: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Differential expression analysis:

Normalization

Page 190: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Normalization

For comparing gene expression between genes within a sample,

normalize for

• Gene length

• Gene GC content

For comparing gene expression between samples, normalize for

• Library size (number of reads obtained)

• RNA composition effect

Page 191: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

“FPKM and TC are ineffective and should be definitely

abandoned in the context of differential analysis”

“In the presence of high count genes, only DESeq and

TMM (edgeR) are able to maintain a reasonable false

positive rate without any loss of power”

Page 192: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

RPKM and FPKM

Reads (or fragments) per kilobase per million mapped reads.

• 20 kb transcript has 400 counts, library size is 20 million reads

RPKM = (400/20) / 20 = 1

• 0.5 kb transcript has 10 counts, library size is 20 million reads

RPKM = (10/0.5) / 20 = 1

Normalizes for gene length and library size

Can be used only for reporting expression values, not for testing

differential expression

• Raw counts are needed to assess the measurement precision

correctly

Page 193: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Normalization by edgeR and DESeq

Aim to make normalized counts for non-differentially

expressed genes similar between samples

• Do not aim to adjust count distributions between samples

Assume that

• Most genes are not differentially expressed

• Differentially expressed genes are divided equally between

up- and down-regulation

Do not transform data, but use normalization factors within

statistical testing

Page 194: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Normalization by edgeR and DESeq – how?

DESeq(2)

• Take geometric mean of gene’s counts across all samples

• Divide gene’s counts in a sample by the geometric mean

• Take median of these ratios sample’s normalization factor

(applied to read counts)

edgeR

• Select as reference the sample whose upper quartile is closest to

the mean upper quartile

• Log ratio of gene’s counts in sample vs reference M value

• Take weighted trimmed mean of M-values (TMM) normalization

factor (applied to library sizes) • Trim: Exclude genes with high counts or large differences in expression

• Weights are from the delta method on binomial data

Page 195: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Differential expression analysis:

Dispersion estimation

Page 196: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Dispersion

When comparing gene’s expression levels between groups, it

is important to know also its within-group variability

Dispersion = (BCV)2

• BCV = gene’s biological coefficient of variation

• E.g. if gene’s expression typically differs from replicate to

replicate by 20% (so BCV = 0.2), then this gene’s dispersion is

0.22 = 0.04

Note that the variance seen in counts is a sum of 2 things:

• Sample-to-sample variation (dispersion)

• Uncertainty in measuring expression by counting reads

Page 197: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

How to estimate dispersion reliably?

RNA-seq experiments typically have only a few replicates

it is difficult to estimate within-group variance reliably

Solution: pool information across genes which are expressed

at similar level

Different approaches

• edgeR

• DESeq

• DESeq2

Page 198: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Dispersion estimation by edgeR

Estimates common dispersion for all genes using a

conditional maximum likelyhood approach

Trended dispersion: takes binned common dispersion and

abundance, and fits a curve though these binned values

Tagwise dispersion: uses

empirical Bayes strategy to

shrink gene-wise dispersions

towards the common/trended

one using a weighted

likelyhood approach genes

that are consistent between

replicates are ranked more

highly

Page 199: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Dispersion estimation by DESeq

Models the observed mean-variance relationship for the genes

using either parametric or local regression

User can choose to use the fitted values always, or only when

they are higher than the genewise value

Page 200: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Dispersion estimation by DESeq2

Estimates genewise dispersions using maximum likelyhood

Fits a curve to capture the dependence of these estimates on

the average expression strength

Uses eBayes approach to shrink genewise values towards the

curve.

Shrinkage depends on

• Sample size

Outliers are not shrunk

Page 201: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Differential expression analysis:

Statistical testing

Page 202: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Statistical testing

DESeq and edgeR

• Two group comparisons

• Exact test for negative binomial distribution.

• Multifactor experiments

• Generalized linear model (GLM) likelyhood ratio test. GLM is an

extension of linear models to non-normally distributed response data.

DESeq2

• Shrinks log fold change estimates toward zero using an

eBayes method

• Shrinkage is stronger when counts are low, dispersion is high, or there

are only a few samples

• GLM based Wald test for significance

• Shrunken estimate of log fold change is divided by its standard error and

the resulting z statistic is compared to a standard normal distribution

Page 203: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Filtering

Filter out genes which have little chance of showing

evidence for significant differential expression

• genes which are not expressed

• genes which are expressed at very low level (low counts are

unreliable)

Reduces the severity of multiple testing adjustment

Should be independent

• do not use information on what group the sample belongs to

DESeq2 selects filtering threshold automatically

Page 204: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

edgeR result table

logFC = log2 fold change

logCPM = the average log2 counts per million

Pvalue = the raw p-value

FDR = false discovery rate (Benjamini-Hochberg adjustment

for multiple testing)

Page 205: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

DESeq result table

baseMean = mean of the counts divided by the size factors

for both conditions

baseMeanA = mean of the counts divided by the size factors

for condition A

baseMeanB= mean of the counts divided by the size factors

for condition B

fold change = the ratio meanB/meanA

log2FoldChange = log2 of the fold change

pval = the raw p-value

padj = Benjamini-Hochberg adjusted p-value

Page 206: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

DESeq2 result table

baseMean = mean of counts (divided by size factors) taken

over all samples

log2FoldChange = log2 fold change (FC = ratio meanB/meanA)

lfcSE = standard error of log2 fold change

stat = Wald statistic

pvalue = raw p-value

padj = Benjamini-Hochberg adjusted p-value

Page 207: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Data exploration using MDS plot

edgeR outputs multidimensional scaling (MDS) plot which

shows the relative similarities between samples

Allows you to see if replicates are consistent and if you can

expect to find differentially expressed genes

Distances correspond to the logFC or biological coefficient

of variation (BCV) between each pair of samples

• Calculated using 500 most heterogenous genes (that have

largest tagwise dispersion treating all libraries as one group)

Page 208: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

MDS plot by edgeR

Page 209: Microarray and NGS data analysis with Chipster€¦ · 2. Static images produced by analysis tools ... • You must give this information in order to use annotation-based ... none,

Drosophila data from RNAi knock-down of pasilla gene

• 4 untreated samples

• 2 sequenced single end

• 2 sequenced paired end

• 3 samples treated with RNAi

• 1 sequenced single end

• 2 sequenced paired end

Data set for exercises 10-14