rna-seq data analysis - dkfz · pdf file1 rna-seq data analysis rna-seq data analysis 1....
TRANSCRIPT
1
RNA-seq data analysis
RNA-seq data analysis1. Introduction to RNA-seq2. Quality control, preprocessing3. Alignment to reference4. Quantitation5. Differential expression analysis
• Normalization• Dispersion estimation• Statistical testing
2
Introduction to RNA-seq
RNA-seq for transcriptome analysis
What genes are expressed / differentially expressed?Advantages if compared to microarrays• Able to detect novel transcripts and isoforms• Increased dynamic range• High reproducibility, low background• Detection of gene fusions and SNPs
But• Based on sampling transcript’s expression signal depends on
expression level of other transcripts• Microarrays are based on hybridization probe intensities are independent of
each other
3
Typical steps in RNA-seq
http://cmb.molgen.mpg.de/2ndGenerationSequencing/Solas/RNA-seq.html
Things to take into account
Unequal coverage along transcript• depends on library construction procedure
Sequencing bias• GC-rich and GC-poor regions can be under-sampled• Not so random priming
Longer transcripts give more countsRNA composition effect due to sampling:
4
Quality control, preprocessing(FastQC, PRINSEQ,)
Align reads to reference(TopHat, Star)
RNA-seq data analysis workflow
De novo assembly(Trinity, Velvet+Oases)
Align reads to transcripts
Reference based assembly to detect new transcripts and isoforms(Cufflinks)
Quantitation(HTSeq, Cuffdiff, etc)
reference
reads
Annotation(Blast2GO)
Differential expression analysis(edgeR, DESeq, Cuffdiff)
Quality control, preprocessing(FastQC, PRINSEQ,)
Align reads to reference(TopHat, Star)
RNA-seq data analysis today
De novo assembly(Trinity, Velvet+Oases)
Align reads to transcripts
Reference based assembly to detect new transcripts and isoforms(Cufflinks)
Quantitation(HTSeq, Cuffdiff, etc)
reference
reads
Annotation(Blast2GO)
Differential expression analysis(edgeR, DESeq, Cuffdiff)
5
Quality control
What and why?
Potential problems
• low-quality sequences
• sequencing artifacts
• sequence contamination
• Examples: low confidence bases, Ns, duplicate reads,
sequence bias, adapters, another organism…
Knowing about potential problems in your data allows you to
correct for them before you spend a lot of time on analysis
take them into account when interpreting results
6
Software packages for quality control
FastQC
FastX
PRINSEQ
TagCleaner
....
Things to check
Quality • Per base quality• Per read quality
Composition• Per base composition• GC content and profile
Contaminant identification• Overrepresented sequences and k-mers• Duplicate levels
7
Base qualities
Phred quality score Q = -10 * log10 (probability that the base is wrong)
Encoded as ASCII characters so that 33 is added to the Phred score• This ”Sanger” encoding is used by Illumina 1.8+, 454 and SOLiD • Note: older Illumina encoding added 64 to Phred
Base qualities
8
Per base sequence quality plot (FastQC)
good
so and so
bad
Quality drops suddenly
Problem in the flow cell -> trim the sequences
9
Per base sequence content (FastQC)
Base composition bias
Correct sequence but biased location, typical for Illumina RNA-seq data
10
Base composition bias II
Library has a restriction site at the frontA single sequence makes up of 20% of the library
Human data for 2 cell lines (h1-hESC and GM12878) from the ENCODE project• 76 b single-end reads, no replicatesDrosophila data from experiment studying the role of the pasilla splicing factor using RNAi• 4 untreated samples (2 sequenced single end and 2 paired end)• 3 treated samples (1 sequenced single end and 2 paired end)
Data used in exercises
11
Exercise 1: Quality controlImport file h1-hESC_RNAseq.fastq. View it as text. Run the tool Quality control / Read quality with FastQC. • how many reads are there and how long are they? • what quality encoding is used? • is the base quality good all along the reads?
Preprocessing: Filtering and trimming low quality reads
12
Filter and trim low quality reads: FastX
Filter sequences based on quality• Decide what is the minimum quality value allowed (e.g. 20)• Decide what percentage of bases in a read should have at least
this quality
Trim a certain number of bases from all reads• Decide how many bases should be trimmed and from which end
Filter reads for adapters, ambiguous nucleotides (N) and length• Give the adapter sequence and decide what is the minimum
allowed sequence length after clipping
Filter low quality reads: PRINSEQFilter sequences based on quality scores• Min/ max quality score per base, mean of quality scores
Filter for low complexity• DUST (score 1-100, > 7 means low complexity)• Entropy (score 1-100, < 70 means low complexity)
Filter for Ns• Maximum count/ percentage of Ns that a read is allowed to have
Filter for length • Min/ max length of a read
Filter for duplicates• Exact, reverse complement, or 5’/3’ duplicates
Filter for several criteria• All above, and possibility to get filtered pairs for paired end data
13
Trim low quality reads: PRINSEQ
Trim based on quality scores• Minimum or mean quality score• In a sliding window• From 3’ or 5’ end
Trim polyA/T tails• Minimum number of A/Ts• From left or right
Trim based on several criteria• All above• Trim x bases from left/ right• Trim to length x
Exercise 2: Trim and filter readsTrim reads from 3’ end based on quality• Select the file h1-hESC_RNAseq.fastq and the tool Utilities /
Trim reads for several criteria. Set parameters so that 3’ ends are trimmed when the mean quality in a 3 base window is less than 20.
• Inspect the log file and figure out how many reads were discarded.
Select the file trimmed.fastq and run the tool Quality control / Read quality with PRINSEQ. • Select the result file reads-stats.html and visualization method
Open in external web browser. Does the quality look better? How is the read length distribution now?
Filter out reads that are shorter than 50 bases.• Select trimmed.fastq and the tool Filtering / Filter reads for
length. How many reads get discarded?
14
Aligning (=mapping) reads to reference
Splice-aware aligners
TopHat (uses Bowtie)GSNAPStarRUMMapSplice
15
Bowtie2 conceptsEnd-to-end alignment mode• Read is aligned over its entire length• Maximum alignment score = 0, deduct penalty for each mismatch
(less for low quality base), N, gap opening and gap extension
Local alignment mode• Read ends don’t need to align, if this maximizes the alignment score• Add bonus to alignment score for each match
Differences to Bowtie1: Can handle indels, Ns and pairs where one mate doesn’t align
TopH
at2
wor
kflo
w
16
Mapping qualityConfidence in read’s point of originDepends on • length of alignment• number of mismatches and gaps• uniqueness of the aligned region in the genome
Expressed in Phred scores, like base qualities• Q = -10 * log10 (probability that read was mapped to a wrong
location)• Marked in numbers
Exercise 3: Align reads to reference genomeSelect accepted.fastq and run the tool Alignment / Bowtie2 for single end reads. • how many alignments are there and how many of them have
mapping quality higher than 1 (use the tool Utilities / Count alignments in BAM)
Visualize alignment in genome browser• Select the BAM file and visualization method Genome browser• Select the human genome and set the location to gene CNN2
and click Go.• Untick the box “reads” and set the coverage scale to 1000.• Is the coverage even along the transcript? Can you find SNPs?
17
Quantitation
Software for counting aligned reads per genomic features (genes/exons/transcripts)
HTSeqCuffdiffBEDTools
18
HTSeq countGiven a BAM file and a list of genomic features, counts how many reads map to each feature.• For RNA-seq the features are typically genes, where each
gene is considered as the union of all its exons. • Also exons can be considered as features, e.g., in order to
check for alternative splicing.
Features need to be supplied in GTF file• Note that the chromosome naming has to be the same as in
BAM
3 modes to handle reads which overlap several genes• Union• Intersection-strict• Intersection-nonempty
HTSeq count modes
19
Exercise 4: Count reads per genes
Select the BAM file and run RNA-seq / Map aligned reads to genes with HTSeq so that the minimum alignment quality is 1.
Select htseq-counts.tsv and run Filtering / Filter table by column value.• how many genes have more than 10 counts?
Differential expression analysis
20
Things to take into accountNormalization is required in order to compare expression
between samples
• Different library sizes
• RNA composition bias caused by sampling approach
Model has to account for overdispersion in biological
replicates negative binomial distribution
Raw counts are needed to assess measurement precision
Multiple testing problem
Software packages for DE analysis
edgeR
DESeq
DEXSeq
Cuffdiff
BaySeq
SAMseq
NOIseq
Limma + voom, limma + vst
...
21
Comments from comparisons”Methods based on negative binomial modeling have
improved specificity and sensitivities as well as good
control of false positive errors”
”Cuffdiff performance has reduced sensitivity and
specificity. We postulate that the source of this is related to
the normalization procedure that attempts to account for
both alternative isoform expression and length of
transcripts”
22
Differential expression analysis:Normalization
NormalizationFor comparing gene expression within sample, normalize for
• Gene length
• Gene GC content
For comparing gene expression between samples, normalize for
• Library size (number of reads obtained)
• RNA composition effect
23
“FPKM and TC are ineffective and should be definitely abandoned in the context of differential analysis”
“In the presence of high count genes, only DESeq and TMM (edgeR) are able to maintain a reasonable false positive rate without any loss of power”
RPKM and FPKM Reads/fragments per kilobase per million mapped reads
• Example: 2000 bp transcript has 600 counts, library size is
30 million reads RPKM = (600/2) / 30 = 10
Normalizes for gene length and library size
Can be used only for reporting expression values, not for
testing differential expression
• Raw counts are needed to assess the measurement
precision correctly
24
Normalization by edgeR and DESeqAim to make normalized counts for non-differentially
expressed genes similar between samples
• Do not aim to adjust count distributions between samples
Assume that
• Most genes are not differentially expressed
• Differentially expressed genes are divided equally between
up- and down-regulation
Do not transform data, but use normalization factors within
statistical testing
Normalization by edgeR and DESeq – how?DESeq• Take geometric mean of gene’s counts across all samples• Divide gene’s counts in a sample by the geometric mean• Take median of these ratios sample’s normalization factor
(applied to read counts)
edgeR• Select as reference the sample whose upper quartile is closest to
the mean upper quartile• Log ratio of gene’s counts in sample vs reference M value• Take weighted trimmed mean of M-values (TMM) normalization
factor (applied to library sizes)• Trim: Exclude genes with high counts or large differences in expression• Weights are from the delta method on binomial data
25
Filtering
FilteringFilter out genes which have little chance of showing significant evidence for differential expression• E.g. genes which are not expressed in either experimental
group
Reduces the severity of multiple testing adjustment
Should be independent of experimental setup• do not use information on what group the sample belongs to
26
Differential expression analysis:Dispersion estimation
DispersionDispersion = (BCV)2
• BCV = gene’s biological coefficient of variation• E.g. if gene’s expression typically differs from replicate to
replicate by 20%, this gene’s dispersion is 0.22 = 0.04
Note that the variance seen in counts is a sum of 2 things:• Sample-to-sample variation (dispersion)• Uncertainty in measuring expression by counting reads
27
Dispersion estimation by edgeR and DESeqDESeq• Models the observed mean-variance relationship for the genes
using either parametric or local regression• User can choose to use the fitted values always, or only when
they are higher than the genewise value
edgeR• Estimates common dispersion for all genes using a conditional
maximum likelyhood approach • Trended dispersion: takes binned common dispersion and
abundance, and fits a curve though these binned values• Tagwise dispersion: uses empirical Bayes strategy to shrink
gene-wise dispersions towards the common one using a weighted likelyhood approach genes that are consistent between replicates are ranked more highly
28
Data exploration using MDS plotedgeR outputs multidimensional scaling (MDS) plot which shows the relative similarities between samples
Allows you to see if replicates are consistent and if you can expect to find differentially expressed genes
Distances correspond to the biological coefficient of variation between each pair of samples• Calculated using 500 most heterogenous genes (that have
largest tagwise dispersion treating all libraries as one group)
29
MDS plot by edgeR
Differential expression analysis:Statistical testing
30
Statistical testing by DESeq and edgeRTwo group comparisons• Exact test for negative binomial distribution
Multifactor experiments• Generalized linear model (GLM) likelyhood ratio test
• GLM = extension of linear models to nonnormally distributed response data
Exercise 5: Analyze differential expression with edgeR classic
Select File / Open session and the file pasilla.zip. Inspect the phenodata. • This is a two-group comparison with 7 samples. Some samples
were sequenced single end, some paired end.Select the file counts.tsv and run the tool RNA-seq / Differential expression using edgeR.• do the samples separate according to group or something else
in the MDS plot? • how many differentially expressed genes do you get? • is common dispersion a good approximation of genewise
dispersions, judged by the dispersion plot? • how big fold change is required for DE genes (see MA plot)? • how does the p-value distribution look like in the p-value plot?
31
Exercise 6: Include filtering in edgeR classic
Repeat the run so that you analyze only genes which are expressed in at least 3 samples.• how does the number of DE genes change and why? • did the p-value distribution change? • did the MDS plot change?
Exercise 7: Include read type info in edgeR
Select the file counts.tsv and run the tool RNA-seq / Differential expression using edgeR for multivariate experiments so that you• analyze only genes which are expressed in at least 3 samples and• incorporate the readtype info to analysis: Add readtype column as
main effect 2.
Filter the file edger-glm.tsv using Filtering / Filter table by column value (column = PValue.group, cutoff = 0.05)• how many genes were analyzed and how many are DE?