rna-seq data analysis - dkfz · pdf file1 rna-seq data analysis rna-seq data analysis 1....

1

RNA-seq data analysis

RNA-seq data analysis1. Introduction to RNA-seq2. Quality control, preprocessing3. Alignment to reference4. Quantitation5. Differential expression analysis

• Normalization• Dispersion estimation• Statistical testing

2

Introduction to RNA-seq

RNA-seq for transcriptome analysis

What genes are expressed / differentially expressed?Advantages if compared to microarrays• Able to detect novel transcripts and isoforms• Increased dynamic range• High reproducibility, low background• Detection of gene fusions and SNPs

But• Based on sampling transcript’s expression signal depends on

expression level of other transcripts• Microarrays are based on hybridization probe intensities are independent of

each other

3

Typical steps in RNA-seq

http://cmb.molgen.mpg.de/2ndGenerationSequencing/Solas/RNA-seq.html

Things to take into account

Unequal coverage along transcript• depends on library construction procedure

Sequencing bias• GC-rich and GC-poor regions can be under-sampled• Not so random priming

Longer transcripts give more countsRNA composition effect due to sampling:

4

Quality control, preprocessing(FastQC, PRINSEQ,)

Align reads to reference(TopHat, Star)

RNA-seq data analysis workflow

De novo assembly(Trinity, Velvet+Oases)

Align reads to transcripts

Reference based assembly to detect new transcripts and isoforms(Cufflinks)

Quantitation(HTSeq, Cuffdiff, etc)

reference

reads

Annotation(Blast2GO)

Differential expression analysis(edgeR, DESeq, Cuffdiff)

Quality control, preprocessing(FastQC, PRINSEQ,)

Align reads to reference(TopHat, Star)

RNA-seq data analysis today

De novo assembly(Trinity, Velvet+Oases)

Align reads to transcripts

Reference based assembly to detect new transcripts and isoforms(Cufflinks)

Quantitation(HTSeq, Cuffdiff, etc)

reference

reads

Annotation(Blast2GO)

Differential expression analysis(edgeR, DESeq, Cuffdiff)

5

Quality control

What and why?

Potential problems

• low-quality sequences

• sequencing artifacts

• sequence contamination

• Examples: low confidence bases, Ns, duplicate reads,

sequence bias, adapters, another organism…

Knowing about potential problems in your data allows you to

correct for them before you spend a lot of time on analysis

take them into account when interpreting results

6

Software packages for quality control

FastQC

FastX

PRINSEQ

TagCleaner

....

Things to check

Quality • Per base quality• Per read quality

Composition• Per base composition• GC content and profile

Contaminant identification• Overrepresented sequences and k-mers• Duplicate levels

7

Base qualities

Phred quality score Q = -10 * log10 (probability that the base is wrong)

Encoded as ASCII characters so that 33 is added to the Phred score• This ”Sanger” encoding is used by Illumina 1.8+, 454 and SOLiD • Note: older Illumina encoding added 64 to Phred

Base qualities

8

Per base sequence quality plot (FastQC)

good

so and so

bad

Quality drops suddenly

Problem in the flow cell -> trim the sequences

9

Per base sequence content (FastQC)

Base composition bias

Correct sequence but biased location, typical for Illumina RNA-seq data

10

Base composition bias II

Library has a restriction site at the frontA single sequence makes up of 20% of the library

Human data for 2 cell lines (h1-hESC and GM12878) from the ENCODE project• 76 b single-end reads, no replicatesDrosophila data from experiment studying the role of the pasilla splicing factor using RNAi• 4 untreated samples (2 sequenced single end and 2 paired end)• 3 treated samples (1 sequenced single end and 2 paired end)

Data used in exercises

11

Exercise 1: Quality controlImport file h1-hESC_RNAseq.fastq. View it as text. Run the tool Quality control / Read quality with FastQC. • how many reads are there and how long are they? • what quality encoding is used? • is the base quality good all along the reads?

Preprocessing: Filtering and trimming low quality reads

12

Filter and trim low quality reads: FastX

Filter sequences based on quality• Decide what is the minimum quality value allowed (e.g. 20)• Decide what percentage of bases in a read should have at least

this quality

Trim a certain number of bases from all reads• Decide how many bases should be trimmed and from which end

Filter reads for adapters, ambiguous nucleotides (N) and length• Give the adapter sequence and decide what is the minimum

allowed sequence length after clipping

Filter low quality reads: PRINSEQFilter sequences based on quality scores• Min/ max quality score per base, mean of quality scores

Filter for low complexity• DUST (score 1-100, > 7 means low complexity)• Entropy (score 1-100, < 70 means low complexity)

Filter for Ns• Maximum count/ percentage of Ns that a read is allowed to have

Filter for length • Min/ max length of a read

Filter for duplicates• Exact, reverse complement, or 5’/3’ duplicates

Filter for several criteria• All above, and possibility to get filtered pairs for paired end data

13

Trim low quality reads: PRINSEQ

Trim based on quality scores• Minimum or mean quality score• In a sliding window• From 3’ or 5’ end

Trim polyA/T tails• Minimum number of A/Ts• From left or right

Trim based on several criteria• All above• Trim x bases from left/ right• Trim to length x

Exercise 2: Trim and filter readsTrim reads from 3’ end based on quality• Select the file h1-hESC_RNAseq.fastq and the tool Utilities /

Trim reads for several criteria. Set parameters so that 3’ ends are trimmed when the mean quality in a 3 base window is less than 20.

• Inspect the log file and figure out how many reads were discarded.

Select the file trimmed.fastq and run the tool Quality control / Read quality with PRINSEQ. • Select the result file reads-stats.html and visualization method

Open in external web browser. Does the quality look better? How is the read length distribution now?

Filter out reads that are shorter than 50 bases.• Select trimmed.fastq and the tool Filtering / Filter reads for

length. How many reads get discarded?

14

Aligning (=mapping) reads to reference

Splice-aware aligners

TopHat (uses Bowtie)GSNAPStarRUMMapSplice

15

Bowtie2 conceptsEnd-to-end alignment mode• Read is aligned over its entire length• Maximum alignment score = 0, deduct penalty for each mismatch

(less for low quality base), N, gap opening and gap extension

Local alignment mode• Read ends don’t need to align, if this maximizes the alignment score• Add bonus to alignment score for each match

Differences to Bowtie1: Can handle indels, Ns and pairs where one mate doesn’t align

TopH

at2

wor

kflo

w

16

Mapping qualityConfidence in read’s point of originDepends on • length of alignment• number of mismatches and gaps• uniqueness of the aligned region in the genome

Expressed in Phred scores, like base qualities• Q = -10 * log10 (probability that read was mapped to a wrong

location)• Marked in numbers

Exercise 3: Align reads to reference genomeSelect accepted.fastq and run the tool Alignment / Bowtie2 for single end reads. • how many alignments are there and how many of them have

mapping quality higher than 1 (use the tool Utilities / Count alignments in BAM)

Visualize alignment in genome browser• Select the BAM file and visualization method Genome browser• Select the human genome and set the location to gene CNN2

and click Go.• Untick the box “reads” and set the coverage scale to 1000.• Is the coverage even along the transcript? Can you find SNPs?

17

Quantitation

Software for counting aligned reads per genomic features (genes/exons/transcripts)

HTSeqCuffdiffBEDTools

18

HTSeq countGiven a BAM file and a list of genomic features, counts how many reads map to each feature.• For RNA-seq the features are typically genes, where each

gene is considered as the union of all its exons. • Also exons can be considered as features, e.g., in order to

check for alternative splicing.

Features need to be supplied in GTF file• Note that the chromosome naming has to be the same as in

BAM

3 modes to handle reads which overlap several genes• Union• Intersection-strict• Intersection-nonempty

HTSeq count modes

19

Exercise 4: Count reads per genes

Select the BAM file and run RNA-seq / Map aligned reads to genes with HTSeq so that the minimum alignment quality is 1.

Select htseq-counts.tsv and run Filtering / Filter table by column value.• how many genes have more than 10 counts?

Differential expression analysis

20

Things to take into accountNormalization is required in order to compare expression

between samples

• Different library sizes

• RNA composition bias caused by sampling approach

Model has to account for overdispersion in biological

replicates negative binomial distribution

Raw counts are needed to assess measurement precision

Multiple testing problem

Software packages for DE analysis

edgeR

DESeq

DEXSeq

Cuffdiff

BaySeq

SAMseq

NOIseq

Limma + voom, limma + vst

...

21

Comments from comparisons”Methods based on negative binomial modeling have

improved specificity and sensitivities as well as good

control of false positive errors”

”Cuffdiff performance has reduced sensitivity and

specificity. We postulate that the source of this is related to

the normalization procedure that attempts to account for

both alternative isoform expression and length of

transcripts”

22

Differential expression analysis:Normalization

NormalizationFor comparing gene expression within sample, normalize for

• Gene length

• Gene GC content

For comparing gene expression between samples, normalize for

• Library size (number of reads obtained)

• RNA composition effect

23

“FPKM and TC are ineffective and should be definitely abandoned in the context of differential analysis”

“In the presence of high count genes, only DESeq and TMM (edgeR) are able to maintain a reasonable false positive rate without any loss of power”

RPKM and FPKM Reads/fragments per kilobase per million mapped reads

• Example: 2000 bp transcript has 600 counts, library size is

30 million reads RPKM = (600/2) / 30 = 10

Normalizes for gene length and library size

Can be used only for reporting expression values, not for

testing differential expression

• Raw counts are needed to assess the measurement

precision correctly

24

Normalization by edgeR and DESeqAim to make normalized counts for non-differentially

expressed genes similar between samples

• Do not aim to adjust count distributions between samples

Assume that

• Most genes are not differentially expressed

• Differentially expressed genes are divided equally between

up- and down-regulation

Do not transform data, but use normalization factors within

statistical testing

Normalization by edgeR and DESeq – how?DESeq• Take geometric mean of gene’s counts across all samples• Divide gene’s counts in a sample by the geometric mean• Take median of these ratios sample’s normalization factor

(applied to read counts)

edgeR• Select as reference the sample whose upper quartile is closest to

the mean upper quartile• Log ratio of gene’s counts in sample vs reference M value• Take weighted trimmed mean of M-values (TMM) normalization

factor (applied to library sizes)• Trim: Exclude genes with high counts or large differences in expression• Weights are from the delta method on binomial data

25

Filtering

FilteringFilter out genes which have little chance of showing significant evidence for differential expression• E.g. genes which are not expressed in either experimental

group

Reduces the severity of multiple testing adjustment

Should be independent of experimental setup• do not use information on what group the sample belongs to

26

Differential expression analysis:Dispersion estimation

DispersionDispersion = (BCV)2

• BCV = gene’s biological coefficient of variation• E.g. if gene’s expression typically differs from replicate to

replicate by 20%, this gene’s dispersion is 0.22 = 0.04

Note that the variance seen in counts is a sum of 2 things:• Sample-to-sample variation (dispersion)• Uncertainty in measuring expression by counting reads

27

Dispersion estimation by edgeR and DESeqDESeq• Models the observed mean-variance relationship for the genes

using either parametric or local regression• User can choose to use the fitted values always, or only when

they are higher than the genewise value

edgeR• Estimates common dispersion for all genes using a conditional

maximum likelyhood approach • Trended dispersion: takes binned common dispersion and

abundance, and fits a curve though these binned values• Tagwise dispersion: uses empirical Bayes strategy to shrink

gene-wise dispersions towards the common one using a weighted likelyhood approach genes that are consistent between replicates are ranked more highly

28

Data exploration using MDS plotedgeR outputs multidimensional scaling (MDS) plot which shows the relative similarities between samples

Allows you to see if replicates are consistent and if you can expect to find differentially expressed genes

Distances correspond to the biological coefficient of variation between each pair of samples• Calculated using 500 most heterogenous genes (that have

largest tagwise dispersion treating all libraries as one group)

29

MDS plot by edgeR

Differential expression analysis:Statistical testing

30

Statistical testing by DESeq and edgeRTwo group comparisons• Exact test for negative binomial distribution

Multifactor experiments• Generalized linear model (GLM) likelyhood ratio test

• GLM = extension of linear models to nonnormally distributed response data

Exercise 5: Analyze differential expression with edgeR classic

Select File / Open session and the file pasilla.zip. Inspect the phenodata. • This is a two-group comparison with 7 samples. Some samples

were sequenced single end, some paired end.Select the file counts.tsv and run the tool RNA-seq / Differential expression using edgeR.• do the samples separate according to group or something else

in the MDS plot? • how many differentially expressed genes do you get? • is common dispersion a good approximation of genewise

dispersions, judged by the dispersion plot? • how big fold change is required for DE genes (see MA plot)? • how does the p-value distribution look like in the p-value plot?

31

Exercise 6: Include filtering in edgeR classic

Repeat the run so that you analyze only genes which are expressed in at least 3 samples.• how does the number of DE genes change and why? • did the p-value distribution change? • did the MDS plot change?

Exercise 7: Include read type info in edgeR

Select the file counts.tsv and run the tool RNA-seq / Differential expression using edgeR for multivariate experiments so that you• analyze only genes which are expressed in at least 3 samples and• incorporate the readtype info to analysis: Add readtype column as

main effect 2.

Filter the file edger-glm.tsv using Filtering / Filter table by column value (column = PValue.group, cutoff = 0.05)• how many genes were analyzed and how many are DE?

rna-seq data analysis - dkfz · pdf file1 rna-seq data analysis rna-seq data analysis 1....

Documents