bioinformatics lab episode vii differential expression ......bioinformatics lab episode vii...

BIOINFORMATICS LAB

Episode VII – Differential

Expression Analysis

Federico M. Giorgi, PhDChiara Cabrelle, TA

Department of Pharmacy and Biotechnology

First Cycle Degree in Genomics

2/60

Differential Analysis

Basic question: is A different from B?

Specific questions: is A taller than B?

BA

3/60

Differential Analysis

Often, you can compare distributions of values, to compare the same

measurement in two different groups:

males

4/60

Differential Gene Expression Analysis

In Genomics, a common question is:

is gene A more/less expressed in condition 1 vs. condition 2?

5/60

Differential Gene Expression Analysis

In Science, every claim requires a statement on robustness. Here: how strongly you trust the

difference between Ctrl and Treated cells. You need to add a p-value to your claim

p=super low

(0)

p=very high

(1)p=low

(0.0031)p=very low

(0.0000016)

The p-value, i.e. the probability of obtaining an identical or better difference by chance, depends on:• The entity of the difference (difference of the medians)

• The overlap of the distributions• The number of samples (the higher, the more significant)

• The test used to calculate it

6/60

• T-test

– Fast, well known

– It requires normal distributions (you can test normality with the Shapiro Test)

– in R: t.test()

• Wilcoxon Test

– Rank-transforms the data (like Spearman Correlation)

– It does not require normality

– Less powerful than T-test. E.g. cannot find significance when the gene is

measured in 3 vs. 3 samples or less

– in R: wilcox.test()

Common Differential Tests

7/60

Real World Examples# Get the Breast Cancer dataset

load(url("https://www.dropbox.com/s/rp6maq49d8t3via/expmat.rda?dl=1"))

Let's test the difference between normal and tumor samples for a gene

# Get the annotation:

load(url("https://www.dropbox.com/s/eb24au609ejh8v1/tumnorm.rda?dl=1"))

# Now you have two vectors of sample names: tumors and proximal normal

Create two groups: Tumor and Normal



# Now you have two vectors of sample names: tumors and proximal normal

tumormat

8/60

The Fold Change and the Direction of the Change

Gene Expression

Fold Change

Treatment (Median)

Control (Median)

TreatmentControl11.5 12.3

12.3

11.5Fold Change: =1.07

12.3

11.5Log2FC: Log2 =0.09 The Log2FC (“Log2 Fold Change”)

Describes the Direction and the Strength of the change

if positive, the gene expression is higher in the treatment

if negative, the gene expression is higher in the control

9/60

Dataset Reminder

A dataset: the TCGA Breast Cancer RNA-Seq dataset

~1200 patients

~20k genes

Gene

Expression

Values

R object:

• An Expression Matrix• expmat

Fields:

• rownames(expmat): Gene Symbols

• colnames(expmat): TCGA Sample IDs

10/60

Exercises!

• Estimate the differential expression Tumor vs. Normal for these genes:

FOXM1, PTEN, PFAS

Using the following operations in R:

– Box Plot

– Log2FC

– T Test

– Wilcoxon Test

• (Hard) Find a way to get a T Test pvalue Tumor vs. Normal for all genes

Hint: apply() or for loop

Hint 2: t.test(x,y)$p.value will return only the p-valueExtra: you should remember: if you run a lot of tests, p.values need to be FDR-corrected with p.adjust()

Final question: find the most differentially expressed gene in Tumor vs. Normal

11/60

The MA plotmeanexp

12/60

The MA plot (Part 2)# Color significant points

significant

13/60

The MA plot (Part 3)# Add Gene Names

mygenes

14/60

The Volcano Plotlog10p

15/60

The Volcano Plot (Part 2)# Add Gene Names

mygenes

16/60

• Big dataset (>100 samples) are rare in Biology

• Often, Differential Expression is calculated with a few samples

A small dataset

Example: RNA-Seq experiment on four human airway smooth muscle cell

lines treated with dexamethasone. PMID: 24926665. GEO: GSE52778.

4 replicates treated

4 replicates untreated (control, reference)

load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))

17/60

A small dataset


# Dataset size

dim(rawcounts)

# Annotation

annot

• This dataset comes as RAW COUNTS, a standard raw format. Simply, the

number of NGS reads aligned on a gene

18/60

Raw Counts


# Dataset size

dim(rawcounts)

# Annotation

annot

• This dataset comes as RAW COUNTS, a standard raw format. Simply, the

number of NGS reads aligned on a gene

19/60

• Raw counts format is easy to spot: expression values are integers

Raw Counts

• The easiest normalization method is RPM (Reads per Million mapped

reads), a normalization that makes expression values comparable across

samples

rpms

20/60

Exercises!• Calculate the Effects on Gene Expression of dexamethasone, by comparing

treated samples (trt) with untreated (untrt)

– Calculate T-test on RPM-normalized data

• Plot a Volcano Plot– Significant (p

21/60

• T-test and Wilcoxon test are not designed for Genomics datasets

– Often few replicates (in our example, 4 vs. 4)

– Many genes tested (brutal FDR correction)

• So the Gurus of Bioinformatics have invented specific tests for Differential

Gene Expression. There are two main alternatives:

– DESeq2

– edgeR

• Long story short (trust me on this)

– The Tests of these tools are more confident

– Because the tests infer gene variances (one of the parameters of the test)

with other genes of similar expression

– Assumption: that only a fraction (

22/60

DEseq2 Example Run

# Load Dexamethasone dataset on smooth muscle cells


23/60

DEseq2 Example Run



# Load DESeq2 package (install if necessary, google the package name)

library(DESeq2)

24/60

DEseq2 Example Run




library(DESeq2)

# Read Vignette

browseVignettes("DESeq2")

25/60

DEseq2 Example Run




library(DESeq2)

# Read Vignette


26/60

DEseq2 Example Run




library(DESeq2)

# Read Vignette


27/60

DEseq2: Annotation Object

# Define Dexamethasone treatment annotation

dex

28/60

DEseq2: Run the Analysis

# Generate a DESeq2 object from a Count matrix input

dds

29/60

DEseq2 Output Table

View(res)

30/60

DEseq2 Output Table

View(res)

Gene

Symbols

Mean

Expression

(A)

Log2FC

(M)

Log2FC

Standard Error

DESeq2

Statistics

p-value

FDR-

corrected

p-value

31/60

DEseq2 Output Table

View(res)

Gene

Symbols

Mean

Expression

(A)

Log2FC

(M)

Log2FC

Standard Error

DESeq2

Statistics

p-value

FDR-

corrected

p-value

Check SPARCL1 function: an inflammation protein-coding gene!

32/60

edgeR Example Run



33/60

edgeR Example Run



# Load edgeR package (install if necessary, google the package name)

library(edgeR)

34/60

edgeR Example Run




library(edgeR)

# Read Vignette

browseVignettes("edgeR")

35/60

edgeR Example Run




library(edgeR)

# Read Vignette


36/60

edgeR Example Run




library(edgeR)

# Read Vignette


• edgeR is usually FASTER than DESeq2

• edgeR allows for more complex designs (e.g.

treatment at different time points on different

genotypes)

• edgeR is more stringent (higher p-values)

37/60

edgeR: Annotation Object (same as DESeq2)

# Define Dexamethasone treatment annotation

dex

38/60

edgeR: set up the Experimental Design

# Design Structure

design

39/60

edgeR: set up the Experimental Design and Normalize

# Design Structure

design

40/60

edgeR: Run the Analysis

# Run edgeR test

fit

41/60

edgeR Example Output

View(res2)

Gene

SymbolsLog Mean

Expression (A)

LogFC (M)

edgeR

Likelihood

Ratio Statistics

p-value

FDR-

corrected

p-value

42/60

Exercises Part 1! edgeR vs. DESeq2

• Compare the Results of DESeq2 and edgeR

– Scatterplot and Pearson Correlation of logFC (cor.test)

## Get vectors of LogFC

de_fc

43/60



– Intersection (p adjusted

44/60



– Venn Diagram of similarity (see below)

– Create two Venn Diagrams: significant genes (adjusted p0) and downregulated (logFC

45/60

Basic Machine Learning

DNAGenomic

Alteration

causesRNA

Δ Gene Expression

Expression-Based predictors of

Genomic Alterations in Cancer

46/60

Basic Machine Learning

DNAGenomic

Alteration

RNAΔ Gene

Expression

Expression-Based predictors of

Genomic Alterations in Cancer

can predict the presence

47/60

Performance of Several Machine Learning Algorithms

48/60

The caret package (short for Classification And REgression Training) is a set

of functions that attempt to streamline the process for creating predictive

models. The package contains tools for:

• data splitting

• pre-processing

• feature selection

• model tuning using resampling

• variable importance estimation

The R caret package

49/60

The Machine Learning problem of today

TP53

Mutations

(DNA)

Can they be predicted by

Gene Expression Profiles?

(RNA)

In this case:

• DNA mutations are considered categorical (yes/no 1/0)

• RNA Gene Expression Profiles are continuous (and already

normalized)

Context: Breast Cancer Human Samples

50/60

• Get Predictive Features (Expression Matrix)

• Get Outcome to Predict (TP53 Mutation Track)

• Make them comparable (use only samples where both were measured,

and keep them in the same order)

The Machine Learning Setup - Step 1

### Expression data (the features)


### Mutation data (what to predict)

load(url("https://www.dropbox.com/s/lju1cy3633rmw1q/tp53mut.rda?dl=1"))

### Colnames are not harmonius

# Consider only the first 15 characters

colnames(expmat)

51/60

• Brutal Feature Selection to speed up computation: keep only the 1000

genes with the highest variance

• Build data object (data frame with both GEPs and mutations)

• Define what you want to predict (here, the presence of TP53 mutation)


### Keep only the 1000 most variable genes

vars

52/60

• Load the caret package

• Specify which feature to predict and which features are the predictors


### Caret package

library(caret)

# We want to predict the outcome (the mutations), from a set of

# predictors (gene expression data), which we code here

prop.table(table(df[,"tp53mut"]))

outcomeName

53/60

• Select a Machine Learning Algorithm

• Full list at https://rdrr.io/cran/caret/man/models.html


method

54/60

• Set a seed (for reproducibility)

• Divide the dataset into two independent pieces

– 75% Training Set (to tune the parameters)

– 25% Test Set (to test the model)

• The two sets should have the same mutation/wt ratio


# Set a seed to use, to make the CV steps reproducible

set.seed(1)

# Split the data into training and test sets (0.6, 0.8, try different ones?)

splitIndex

55/60

• Train Model Parameters inside the Training Set

– 10-fold Cross-Validation (split the Training Set in 10 pieces: train using 9

pieces, leaving out the 10th, repeat it 10 times)

– Alternative (slower): Leave-One-Out (LOO). Keep one sample out and use

the others to predict


# Train Model Parameters (10-fold CV is acceptable)

objControl

56/60

• Model Validation

– We have a model trained on the Training Set

– We test it on the Test Set

– This will yield a vector of predictions for the new samples (mut or wt) or a

vector of probabilities (chance to be mutated)


# Validating the Model

# It will predict using the best model: objModel$finalModel

predictions

57/60

• Performance Evaluation

– Confusion Matrix (How many True Positives, False Positives, False Negatives,

etc.)

– Sensitivity, Specificity


# Confusion matrix

confusionMatrix(predictions,testDF$tp53mut)

58/60

• Performance Evaluation

– Confusion Matrix (How many True Positives, False Positives, False Negatives,

etc.)

– Sensitivity, Specificity


# Confusion matrix

confusionMatrix(predictions,testDF$tp53mut)

True Positives

False

Negatives

True

Negatives

False Positives

Random models

have Spec, Sens

and Acc = 0.5

59/60

• Receiving Operator Characteristic: ROC Curve– Shows Specificty and Sensitivity Profiles while varying model parameters

– It will allow to show a different tradeoff (more FN, less FP, or vice versa)


# Calculate a ROC curve

library(pROC)

rocCurve

60/60





library(pROC)

rocCurve

61/60





library(pROC)

rocCurve

62/60

Exercises! Machine Learning

• Repeat the Machine Learning TP53 prediction of mutation

– Adding noise to the expression data (with rnorm)

– With a different method (here I used gradient boost modelling)

• full list at https://rdrr.io/cran/caret/man/models.html

https://rdrr.io/cran/caret/man/models.html

63/60

Further Reading...

www.coursera.org

www.giorgilab.org

Federico M. Giorgi, PhD

Department of Pharmacy and Biotechnology

[email protected]

to EF, bringer of salted breakfasts

65/60

Solutions (Big Dataset T-test)### Exercise



tumormat

66/60

Solutions (Small Dataset T-test)## Small Dataset T-test

treated

67/60

Solutions (edgeR vs. DESeq2)## Get vectors of LogFC

de_fc

68/60

Lorem ipsum

# Get the Breast Cancer dataset

load(url("https://www.dropbox.com/s/nrxrsq8m0gfjwqh/tcga_BRCA-expmat.rda?dl=1"))





70/60

Exercises!

• Do something

71/60

Solutions

# Some solution

s

bioinformatics lab episode vii differential expression ......bioinformatics lab episode vii...

Documents