differential gene expression analysis using rna-seq...

29
Differential Gene Expression Analysis using RNA-Seq Data

Upload: others

Post on 31-May-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

Differential Gene Expression Analysis using RNA-Seq Data

Page 2: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

RNA-Seq Data

1. Biologists collect mRNA from many cells

2. Cells come from two or more biological samples (different tissues)

Page 3: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

RNA-Seq Data Generation

3. Collected mRNA are shredded, size selected, sequenced

Sequenced reads are mapped back to the reference genome

Page 4: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

Mapping RNA-Seq Reads

• Mapping a read to a reference genome is finding the position within genome where the read comes from

• Reads containing splice junctions cannot be mapped to a reference genome directly

• Ways to map reads with splice junctions: – Use special algorithms/methods/techniques

– Map the reads to annotated transcriptome

Page 5: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

Mapping RNA-Seq Reads

• Example: the read TCAAG occurs at position 10 in the given reference genome (if position count starts with 1)

Page 6: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

• Since mRNA are collected from many cells:

– reads cover the entire lengths of exons

– overlapping reads come from different mRNA molecules

Mapping RNA-Seq Reads

Page 7: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

Mapping concerns:

1. A read that is mapped to two or more locations in a reference genome is called ambiguous and is discarded from the analysis

2. Two reads are called copy-duplicates if they are mapped to the same start position in the genome (these might be the product of poly-chain reaction, PCR, that is used to make copies of mRNA segments to make sequencing possible). Only one of copy-duplicates is used in the analysis

Page 8: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

• The number of reads mapped to a single gene/transcript/exon, read count, is used to estimate differential gene expression

• Given two (or more) samples, find the read count for one sample and for the other sample, and use statistics to infer whether these counts are significantly different

Page 9: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

• To estimate a read count for a transcript of a gene is not trivial: – Alternating splicing (if a read is mapped to an

exon shared in two or more transcripts, then we cannot be certain whether the read comes from one transcript or the other)

– Overlapping genes (uncertainty in counting a read that mapped to the region belonging to two or more overlapping genes)

Mapping RNA-Seq Reads

Page 10: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

• To estimate a read count for a transcript of a gene is not trivial

• To remedy:

– Estimate read count for each gene or exon instead

– Use reads containing splice junctions

– In some cases, discard the read from the analysis

Mapping RNA-Seq Reads

Page 11: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

Workflow of RNA-Seq Differential Gene Expression Analysis

Adapted from “RNA-seq Data Analysis: A Practical Approach” by Eija Korpelainen, Jarno Tuimala, Panu Somervuo, Mikael Huss, Garry Wong Chapman & Hall/CRC Mathematical and Computational Biology

Page 12: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

Preprocessing

• Adapters trimming

• Low quality read ends trimming (3’ end)

Page 13: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted
Page 14: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

Adapters Trimming

More on adapter trimming: http://www.ark-genomics.org/events-online-training-eu-training-course/adapter-and-quality-trimming-illumina-data http://training.bioinformatics.ucdavis.edu/docs/2013/02/bootcamp/galaxy/_downloads/qa-and-i.pdf

TOOLS: FASTQC Cutadapt qrqc Scythe

Page 15: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

Adapters Trimming

Page 16: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

Quality Control: Nucleotide Profile

TOOLS: FASTQC qrqc

Page 17: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

Quality Control: Base Quality Profile

Page 18: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

Quality Control: k-mer Enrichment

Page 19: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

Quality Control: Reads Lengths Distribution after Trimming

Page 20: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

Quality Control: Statistics

Page 21: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

Mapping RNA-Seq Reads

• Mapping to a reference genome

• Mapping to transcriptome

• Gene annotation information (start/end of exons in known genes)

Page 22: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

1. Genes are located on both strands of DNA

2. Reads are always sequenced from 5’ to 3’

3. Mapping is performed to only (+) strand of DNA

4. Map the reverse-complement of a read: ATTGC, rc: GCAAT

Slide 22 of 31

G C A A T C T G G C

Mapping RNA-Seq Reads

Page 23: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

Mapping RNA-Seq Reads

Ambiguous Reads (identify and discard)

A read that is mapped with the same (smallest) number of mismatches to two or more locations in the genome

A read that is mapped to both + (positive) and – (negative) strands with the same smallest number of mismatches

Page 24: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

Mapping RNA-Seq Reads

Ambiguous:

Unique:

Page 25: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

Mapping RNA-Seq Reads

• Sequencing instruments require certain quantity of mRNA

• Poly Chain Reaction produces multiple copies of mRNA segments

• Copies of the same segment are sequenced producing copy duplicates (product of PCR not related to the mRNA abundance in biological sample)

Page 26: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

Mapping RNA-Seq Reads

• Two reads are called copy-duplicates if they are mapped to the same start position in the genome (identify and count only one read)

• Copy duplicates can be generated only from the same sample

Page 27: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

Mapping RNA-Seq Reads

• Collect mapping statistics: – Total reads that were attempted for mapping

– Total unique reads mapped

– Total ambiguous reads mapped

– Total copy duplicates

– Distribution of reads by mismatches/indels

– Total reads mapped to splice-junctions

– CG-bias in mapped reads

– Depth of coverage

– 3’ end gene bias (more reads mapped to 3’ end)

Page 28: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

Counting Reads: HTSeq

Page 29: Differential Gene Expression Analysis using RNA-Seq Dataalumni.cs.ucr.edu/~elenah/courses/CSCI693/Lecture3.pdf · Workflow of RNA-Seq Differential Gene Expression Analysis Adapted

Normalization

• Raw read count has to be normalized to enable comparison between samples

• RPKM Reads Per Kilobase and per Million mapped reads

• Total raw reads mapped to a gene divided by the length of the gene in Kilobases and divided by total number of mapped reads in millions

• Sometimes mappable length is used (since ambiguous reads are discarded, repeated regions within genes are not covered by reads)