bioinformatics lab episode vii differential expression ......bioinformatics lab episode vii...
TRANSCRIPT
-
BIOINFORMATICS LAB
Episode VII – Differential
Expression Analysis
Federico M. Giorgi, PhDChiara Cabrelle, TA
Department of Pharmacy and Biotechnology
First Cycle Degree in Genomics
-
2/60
Differential Analysis
Basic question: is A different from B?
Specific questions: is A taller than B?
BA
-
3/60
Differential Analysis
Often, you can compare distributions of values, to compare the same
measurement in two different groups:
males
-
4/60
Differential Gene Expression Analysis
In Genomics, a common question is:
is gene A more/less expressed in condition 1 vs. condition 2?
-
5/60
Differential Gene Expression Analysis
In Science, every claim requires a statement on robustness. Here: how strongly you trust the
difference between Ctrl and Treated cells. You need to add a p-value to your claim
p=super low
(0)
p=very high
(1)p=low
(0.0031)p=very low
(0.0000016)
The p-value, i.e. the probability of obtaining an identical or better difference by chance, depends on:• The entity of the difference (difference of the medians)
• The overlap of the distributions• The number of samples (the higher, the more significant)
• The test used to calculate it
-
6/60
• T-test
– Fast, well known
– It requires normal distributions (you can test normality with the Shapiro Test)
– in R: t.test()
• Wilcoxon Test
– Rank-transforms the data (like Spearman Correlation)
– It does not require normality
– Less powerful than T-test. E.g. cannot find significance when the gene is
measured in 3 vs. 3 samples or less
– in R: wilcox.test()
Common Differential Tests
-
7/60
Real World Examples# Get the Breast Cancer dataset
load(url("https://www.dropbox.com/s/rp6maq49d8t3via/expmat.rda?dl=1"))
Let's test the difference between normal and tumor samples for a gene
# Get the annotation:
load(url("https://www.dropbox.com/s/eb24au609ejh8v1/tumnorm.rda?dl=1"))
# Now you have two vectors of sample names: tumors and proximal normal
Create two groups: Tumor and Normal
# Get the annotation:
load(url("https://www.dropbox.com/s/eb24au609ejh8v1/tumnorm.rda?dl=1"))
# Now you have two vectors of sample names: tumors and proximal normal
tumormat
-
8/60
The Fold Change and the Direction of the Change
Gene Expression
Fold Change
Treatment (Median)
Control (Median)
TreatmentControl11.5 12.3
12.3
11.5Fold Change: =1.07
12.3
11.5Log2FC: Log2 =0.09 The Log2FC (“Log2 Fold Change”)
Describes the Direction and the Strength of the change
if positive, the gene expression is higher in the treatment
if negative, the gene expression is higher in the control
-
9/60
Dataset Reminder
A dataset: the TCGA Breast Cancer RNA-Seq dataset
~1200 patients
~20k genes
Gene
Expression
Values
R object:
• An Expression Matrix• expmat
Fields:
• rownames(expmat): Gene Symbols
• colnames(expmat): TCGA Sample IDs
-
10/60
Exercises!
• Estimate the differential expression Tumor vs. Normal for these genes:
FOXM1, PTEN, PFAS
Using the following operations in R:
– Box Plot
– Log2FC
– T Test
– Wilcoxon Test
• (Hard) Find a way to get a T Test pvalue Tumor vs. Normal for all genes
Hint: apply() or for loop
Hint 2: t.test(x,y)$p.value will return only the p-valueExtra: you should remember: if you run a lot of tests, p.values need to be FDR-corrected with p.adjust()
Final question: find the most differentially expressed gene in Tumor vs. Normal
-
11/60
The MA plotmeanexp
-
12/60
The MA plot (Part 2)# Color significant points
significant
-
13/60
The MA plot (Part 3)# Add Gene Names
mygenes
-
14/60
The Volcano Plotlog10p
-
15/60
The Volcano Plot (Part 2)# Add Gene Names
mygenes
-
16/60
• Big dataset (>100 samples) are rare in Biology
• Often, Differential Expression is calculated with a few samples
A small dataset
Example: RNA-Seq experiment on four human airway smooth muscle cell
lines treated with dexamethasone. PMID: 24926665. GEO: GSE52778.
4 replicates treated
4 replicates untreated (control, reference)
load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))
-
17/60
A small dataset
load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))
# Dataset size
dim(rawcounts)
# Annotation
annot
• This dataset comes as RAW COUNTS, a standard raw format. Simply, the
number of NGS reads aligned on a gene
-
18/60
Raw Counts
load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))
# Dataset size
dim(rawcounts)
# Annotation
annot
• This dataset comes as RAW COUNTS, a standard raw format. Simply, the
number of NGS reads aligned on a gene
-
19/60
• Raw counts format is easy to spot: expression values are integers
Raw Counts
• The easiest normalization method is RPM (Reads per Million mapped
reads), a normalization that makes expression values comparable across
samples
rpms
-
20/60
Exercises!• Calculate the Effects on Gene Expression of dexamethasone, by comparing
treated samples (trt) with untreated (untrt)
– Calculate T-test on RPM-normalized data
• Plot a Volcano Plot– Significant (p
-
21/60
• T-test and Wilcoxon test are not designed for Genomics datasets
– Often few replicates (in our example, 4 vs. 4)
– Many genes tested (brutal FDR correction)
• So the Gurus of Bioinformatics have invented specific tests for Differential
Gene Expression. There are two main alternatives:
– DESeq2
– edgeR
• Long story short (trust me on this)
– The Tests of these tools are more confident
– Because the tests infer gene variances (one of the parameters of the test)
with other genes of similar expression
– Assumption: that only a fraction (
-
22/60
DEseq2 Example Run
# Load Dexamethasone dataset on smooth muscle cells
load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))
-
23/60
DEseq2 Example Run
# Load Dexamethasone dataset on smooth muscle cells
load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))
# Load DESeq2 package (install if necessary, google the package name)
library(DESeq2)
-
24/60
DEseq2 Example Run
# Load Dexamethasone dataset on smooth muscle cells
load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))
# Load DESeq2 package (install if necessary, google the package name)
library(DESeq2)
# Read Vignette
browseVignettes("DESeq2")
-
25/60
DEseq2 Example Run
# Load Dexamethasone dataset on smooth muscle cells
load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))
# Load DESeq2 package (install if necessary, google the package name)
library(DESeq2)
# Read Vignette
browseVignettes("DESeq2")
-
26/60
DEseq2 Example Run
# Load Dexamethasone dataset on smooth muscle cells
load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))
# Load DESeq2 package (install if necessary, google the package name)
library(DESeq2)
# Read Vignette
browseVignettes("DESeq2")
-
27/60
DEseq2: Annotation Object
# Define Dexamethasone treatment annotation
dex
-
28/60
DEseq2: Run the Analysis
# Generate a DESeq2 object from a Count matrix input
dds
-
29/60
DEseq2 Output Table
View(res)
-
30/60
DEseq2 Output Table
View(res)
Gene
Symbols
Mean
Expression
(A)
Log2FC
(M)
Log2FC
Standard Error
DESeq2
Statistics
p-value
FDR-
corrected
p-value
-
31/60
DEseq2 Output Table
View(res)
Gene
Symbols
Mean
Expression
(A)
Log2FC
(M)
Log2FC
Standard Error
DESeq2
Statistics
p-value
FDR-
corrected
p-value
Check SPARCL1 function: an inflammation protein-coding gene!
-
32/60
edgeR Example Run
# Load Dexamethasone dataset on smooth muscle cells
load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))
-
33/60
edgeR Example Run
# Load Dexamethasone dataset on smooth muscle cells
load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))
# Load edgeR package (install if necessary, google the package name)
library(edgeR)
-
34/60
edgeR Example Run
# Load Dexamethasone dataset on smooth muscle cells
load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))
# Load edgeR package (install if necessary, google the package name)
library(edgeR)
# Read Vignette
browseVignettes("edgeR")
-
35/60
edgeR Example Run
# Load Dexamethasone dataset on smooth muscle cells
load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))
# Load edgeR package (install if necessary, google the package name)
library(edgeR)
# Read Vignette
browseVignettes("edgeR")
-
36/60
edgeR Example Run
# Load Dexamethasone dataset on smooth muscle cells
load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))
# Load edgeR package (install if necessary, google the package name)
library(edgeR)
# Read Vignette
browseVignettes("edgeR")
• edgeR is usually FASTER than DESeq2
• edgeR allows for more complex designs (e.g.
treatment at different time points on different
genotypes)
• edgeR is more stringent (higher p-values)
-
37/60
edgeR: Annotation Object (same as DESeq2)
# Define Dexamethasone treatment annotation
dex
-
38/60
edgeR: set up the Experimental Design
# Design Structure
design
-
39/60
edgeR: set up the Experimental Design and Normalize
# Design Structure
design
-
40/60
edgeR: Run the Analysis
# Run edgeR test
fit
-
41/60
edgeR Example Output
View(res2)
Gene
SymbolsLog Mean
Expression (A)
LogFC (M)
edgeR
Likelihood
Ratio Statistics
p-value
FDR-
corrected
p-value
-
42/60
Exercises Part 1! edgeR vs. DESeq2
• Compare the Results of DESeq2 and edgeR
– Scatterplot and Pearson Correlation of logFC (cor.test)
## Get vectors of LogFC
de_fc
-
43/60
Exercises Part 2! edgeR vs. DESeq2
• Compare the Results of DESeq2 and edgeR
– Intersection (p adjusted
-
44/60
Exercises Part 3! edgeR vs. DESeq2
• Compare the Results of DESeq2 and edgeR
– Venn Diagram of similarity (see below)
– Create two Venn Diagrams: significant genes (adjusted p0) and downregulated (logFC
-
45/60
Basic Machine Learning
DNAGenomic
Alteration
causesRNA
Δ Gene Expression
Expression-Based predictors of
Genomic Alterations in Cancer
-
46/60
Basic Machine Learning
DNAGenomic
Alteration
RNAΔ Gene
Expression
Expression-Based predictors of
Genomic Alterations in Cancer
can predict the presence
-
47/60
Performance of Several Machine Learning Algorithms
-
48/60
The caret package (short for Classification And REgression Training) is a set
of functions that attempt to streamline the process for creating predictive
models. The package contains tools for:
• data splitting
• pre-processing
• feature selection
• model tuning using resampling
• variable importance estimation
The R caret package
-
49/60
The Machine Learning problem of today
TP53
Mutations
(DNA)
Can they be predicted by
Gene Expression Profiles?
(RNA)
In this case:
• DNA mutations are considered categorical (yes/no 1/0)
• RNA Gene Expression Profiles are continuous (and already
normalized)
Context: Breast Cancer Human Samples
-
50/60
• Get Predictive Features (Expression Matrix)
• Get Outcome to Predict (TP53 Mutation Track)
• Make them comparable (use only samples where both were measured,
and keep them in the same order)
The Machine Learning Setup - Step 1
### Expression data (the features)
load(url("https://www.dropbox.com/s/rp6maq49d8t3via/expmat.rda?dl=1"))
### Mutation data (what to predict)
load(url("https://www.dropbox.com/s/lju1cy3633rmw1q/tp53mut.rda?dl=1"))
### Colnames are not harmonius
# Consider only the first 15 characters
colnames(expmat)
-
51/60
• Brutal Feature Selection to speed up computation: keep only the 1000
genes with the highest variance
• Build data object (data frame with both GEPs and mutations)
• Define what you want to predict (here, the presence of TP53 mutation)
The Machine Learning Setup - Step 2
### Keep only the 1000 most variable genes
vars
-
52/60
• Load the caret package
• Specify which feature to predict and which features are the predictors
The Machine Learning Setup - Step 3
### Caret package
library(caret)
# We want to predict the outcome (the mutations), from a set of
# predictors (gene expression data), which we code here
prop.table(table(df[,"tp53mut"]))
outcomeName
-
53/60
• Select a Machine Learning Algorithm
• Full list at https://rdrr.io/cran/caret/man/models.html
The Machine Learning Setup - Step 4
method
-
54/60
• Set a seed (for reproducibility)
• Divide the dataset into two independent pieces
– 75% Training Set (to tune the parameters)
– 25% Test Set (to test the model)
• The two sets should have the same mutation/wt ratio
The Machine Learning Setup - Step 5
# Set a seed to use, to make the CV steps reproducible
set.seed(1)
# Split the data into training and test sets (0.6, 0.8, try different ones?)
splitIndex
-
55/60
• Train Model Parameters inside the Training Set
– 10-fold Cross-Validation (split the Training Set in 10 pieces: train using 9
pieces, leaving out the 10th, repeat it 10 times)
– Alternative (slower): Leave-One-Out (LOO). Keep one sample out and use
the others to predict
The Machine Learning Setup - Step 6
# Train Model Parameters (10-fold CV is acceptable)
objControl
-
56/60
• Model Validation
– We have a model trained on the Training Set
– We test it on the Test Set
– This will yield a vector of predictions for the new samples (mut or wt) or a
vector of probabilities (chance to be mutated)
The Machine Learning Setup - Step 7
# Validating the Model
# It will predict using the best model: objModel$finalModel
predictions
-
57/60
• Performance Evaluation
– Confusion Matrix (How many True Positives, False Positives, False Negatives,
etc.)
– Sensitivity, Specificity
The Machine Learning Setup - Step 8
# Confusion matrix
confusionMatrix(predictions,testDF$tp53mut)
-
58/60
• Performance Evaluation
– Confusion Matrix (How many True Positives, False Positives, False Negatives,
etc.)
– Sensitivity, Specificity
The Machine Learning Setup - Step 8
# Confusion matrix
confusionMatrix(predictions,testDF$tp53mut)
True Positives
False
Negatives
True
Negatives
False Positives
Random models
have Spec, Sens
and Acc = 0.5
-
59/60
• Receiving Operator Characteristic: ROC Curve– Shows Specificty and Sensitivity Profiles while varying model parameters
– It will allow to show a different tradeoff (more FN, less FP, or vice versa)
The Machine Learning Setup - Step 9
# Calculate a ROC curve
library(pROC)
rocCurve
-
60/60
• Receiving Operator Characteristic: ROC Curve– Shows Specificty and Sensitivity Profiles while varying model parameters
– It will allow to show a different tradeoff (more FN, less FP, or vice versa)
The Machine Learning Setup - Step 9
# Calculate a ROC curve
library(pROC)
rocCurve
-
61/60
• Receiving Operator Characteristic: ROC Curve– Shows Specificty and Sensitivity Profiles while varying model parameters
– It will allow to show a different tradeoff (more FN, less FP, or vice versa)
The Machine Learning Setup - Step 9
# Calculate a ROC curve
library(pROC)
rocCurve
-
62/60
Exercises! Machine Learning
• Repeat the Machine Learning TP53 prediction of mutation
– Adding noise to the expression data (with rnorm)
– With a different method (here I used gradient boost modelling)
• full list at https://rdrr.io/cran/caret/man/models.html
https://rdrr.io/cran/caret/man/models.html
-
63/60
Further Reading...
www.coursera.org
-
www.giorgilab.org
Federico M. Giorgi, PhD
Department of Pharmacy and Biotechnology
to EF, bringer of salted breakfasts
-
65/60
Solutions (Big Dataset T-test)### Exercise
load(url("https://www.dropbox.com/s/rp6maq49d8t3via/expmat.rda?dl=1"))
load(url("https://www.dropbox.com/s/eb24au609ejh8v1/tumnorm.rda?dl=1"))
tumormat
-
66/60
Solutions (Small Dataset T-test)## Small Dataset T-test
treated
-
67/60
Solutions (edgeR vs. DESeq2)## Get vectors of LogFC
de_fc
-
68/60
Lorem ipsum
# Get the Breast Cancer dataset
load(url("https://www.dropbox.com/s/nrxrsq8m0gfjwqh/tcga_BRCA-expmat.rda?dl=1"))
Let's test the difference between normal and tumor samples for a gene
# Get the annotation:
Let's test the difference between normal and tumor samples for a gene
# Get the annotation:
-
69/60
-
70/60
Exercises!
• Do something
-
71/60
Solutions
# Some solution
s