bioinformatics lab episode vii differential expression ......bioinformatics lab episode vii...

71
BIOINFORMATICS LAB Episode VII – Differential Expression Analysis Federico M. Giorgi, PhD Chiara Cabrelle, TA Department of Pharmacy and Biotechnology First Cycle Degree in Genomics

Upload: others

Post on 04-Feb-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

  • BIOINFORMATICS LAB

    Episode VII – Differential

    Expression Analysis

    Federico M. Giorgi, PhDChiara Cabrelle, TA

    Department of Pharmacy and Biotechnology

    First Cycle Degree in Genomics

  • 2/60

    Differential Analysis

    Basic question: is A different from B?

    Specific questions: is A taller than B?

    BA

  • 3/60

    Differential Analysis

    Often, you can compare distributions of values, to compare the same

    measurement in two different groups:

    males

  • 4/60

    Differential Gene Expression Analysis

    In Genomics, a common question is:

    is gene A more/less expressed in condition 1 vs. condition 2?

  • 5/60

    Differential Gene Expression Analysis

    In Science, every claim requires a statement on robustness. Here: how strongly you trust the

    difference between Ctrl and Treated cells. You need to add a p-value to your claim

    p=super low

    (0)

    p=very high

    (1)p=low

    (0.0031)p=very low

    (0.0000016)

    The p-value, i.e. the probability of obtaining an identical or better difference by chance, depends on:• The entity of the difference (difference of the medians)

    • The overlap of the distributions• The number of samples (the higher, the more significant)

    • The test used to calculate it

  • 6/60

    • T-test

    – Fast, well known

    – It requires normal distributions (you can test normality with the Shapiro Test)

    – in R: t.test()

    • Wilcoxon Test

    – Rank-transforms the data (like Spearman Correlation)

    – It does not require normality

    – Less powerful than T-test. E.g. cannot find significance when the gene is

    measured in 3 vs. 3 samples or less

    – in R: wilcox.test()

    Common Differential Tests

  • 7/60

    Real World Examples# Get the Breast Cancer dataset

    load(url("https://www.dropbox.com/s/rp6maq49d8t3via/expmat.rda?dl=1"))

    Let's test the difference between normal and tumor samples for a gene

    # Get the annotation:

    load(url("https://www.dropbox.com/s/eb24au609ejh8v1/tumnorm.rda?dl=1"))

    # Now you have two vectors of sample names: tumors and proximal normal

    Create two groups: Tumor and Normal

    # Get the annotation:

    load(url("https://www.dropbox.com/s/eb24au609ejh8v1/tumnorm.rda?dl=1"))

    # Now you have two vectors of sample names: tumors and proximal normal

    tumormat

  • 8/60

    The Fold Change and the Direction of the Change

    Gene Expression

    Fold Change

    Treatment (Median)

    Control (Median)

    TreatmentControl11.5 12.3

    12.3

    11.5Fold Change: =1.07

    12.3

    11.5Log2FC: Log2 =0.09 The Log2FC (“Log2 Fold Change”)

    Describes the Direction and the Strength of the change

    if positive, the gene expression is higher in the treatment

    if negative, the gene expression is higher in the control

  • 9/60

    Dataset Reminder

    A dataset: the TCGA Breast Cancer RNA-Seq dataset

    ~1200 patients

    ~20k genes

    Gene

    Expression

    Values

    R object:

    • An Expression Matrix• expmat

    Fields:

    • rownames(expmat): Gene Symbols

    • colnames(expmat): TCGA Sample IDs

  • 10/60

    Exercises!

    • Estimate the differential expression Tumor vs. Normal for these genes:

    FOXM1, PTEN, PFAS

    Using the following operations in R:

    – Box Plot

    – Log2FC

    – T Test

    – Wilcoxon Test

    • (Hard) Find a way to get a T Test pvalue Tumor vs. Normal for all genes

    Hint: apply() or for loop

    Hint 2: t.test(x,y)$p.value will return only the p-valueExtra: you should remember: if you run a lot of tests, p.values need to be FDR-corrected with p.adjust()

    Final question: find the most differentially expressed gene in Tumor vs. Normal

  • 11/60

    The MA plotmeanexp

  • 12/60

    The MA plot (Part 2)# Color significant points

    significant

  • 13/60

    The MA plot (Part 3)# Add Gene Names

    mygenes

  • 14/60

    The Volcano Plotlog10p

  • 15/60

    The Volcano Plot (Part 2)# Add Gene Names

    mygenes

  • 16/60

    • Big dataset (>100 samples) are rare in Biology

    • Often, Differential Expression is calculated with a few samples

    A small dataset

    Example: RNA-Seq experiment on four human airway smooth muscle cell

    lines treated with dexamethasone. PMID: 24926665. GEO: GSE52778.

    4 replicates treated

    4 replicates untreated (control, reference)

    load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))

  • 17/60

    A small dataset

    load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))

    # Dataset size

    dim(rawcounts)

    # Annotation

    annot

    • This dataset comes as RAW COUNTS, a standard raw format. Simply, the

    number of NGS reads aligned on a gene

  • 18/60

    Raw Counts

    load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))

    # Dataset size

    dim(rawcounts)

    # Annotation

    annot

    • This dataset comes as RAW COUNTS, a standard raw format. Simply, the

    number of NGS reads aligned on a gene

  • 19/60

    • Raw counts format is easy to spot: expression values are integers

    Raw Counts

    • The easiest normalization method is RPM (Reads per Million mapped

    reads), a normalization that makes expression values comparable across

    samples

    rpms

  • 20/60

    Exercises!• Calculate the Effects on Gene Expression of dexamethasone, by comparing

    treated samples (trt) with untreated (untrt)

    – Calculate T-test on RPM-normalized data

    • Plot a Volcano Plot– Significant (p

  • 21/60

    • T-test and Wilcoxon test are not designed for Genomics datasets

    – Often few replicates (in our example, 4 vs. 4)

    – Many genes tested (brutal FDR correction)

    • So the Gurus of Bioinformatics have invented specific tests for Differential

    Gene Expression. There are two main alternatives:

    – DESeq2

    – edgeR

    • Long story short (trust me on this)

    – The Tests of these tools are more confident

    – Because the tests infer gene variances (one of the parameters of the test)

    with other genes of similar expression

    – Assumption: that only a fraction (

  • 22/60

    DEseq2 Example Run

    # Load Dexamethasone dataset on smooth muscle cells

    load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))

  • 23/60

    DEseq2 Example Run

    # Load Dexamethasone dataset on smooth muscle cells

    load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))

    # Load DESeq2 package (install if necessary, google the package name)

    library(DESeq2)

  • 24/60

    DEseq2 Example Run

    # Load Dexamethasone dataset on smooth muscle cells

    load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))

    # Load DESeq2 package (install if necessary, google the package name)

    library(DESeq2)

    # Read Vignette

    browseVignettes("DESeq2")

  • 25/60

    DEseq2 Example Run

    # Load Dexamethasone dataset on smooth muscle cells

    load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))

    # Load DESeq2 package (install if necessary, google the package name)

    library(DESeq2)

    # Read Vignette

    browseVignettes("DESeq2")

  • 26/60

    DEseq2 Example Run

    # Load Dexamethasone dataset on smooth muscle cells

    load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))

    # Load DESeq2 package (install if necessary, google the package name)

    library(DESeq2)

    # Read Vignette

    browseVignettes("DESeq2")

  • 27/60

    DEseq2: Annotation Object

    # Define Dexamethasone treatment annotation

    dex

  • 28/60

    DEseq2: Run the Analysis

    # Generate a DESeq2 object from a Count matrix input

    dds

  • 29/60

    DEseq2 Output Table

    View(res)

  • 30/60

    DEseq2 Output Table

    View(res)

    Gene

    Symbols

    Mean

    Expression

    (A)

    Log2FC

    (M)

    Log2FC

    Standard Error

    DESeq2

    Statistics

    p-value

    FDR-

    corrected

    p-value

  • 31/60

    DEseq2 Output Table

    View(res)

    Gene

    Symbols

    Mean

    Expression

    (A)

    Log2FC

    (M)

    Log2FC

    Standard Error

    DESeq2

    Statistics

    p-value

    FDR-

    corrected

    p-value

    Check SPARCL1 function: an inflammation protein-coding gene!

  • 32/60

    edgeR Example Run

    # Load Dexamethasone dataset on smooth muscle cells

    load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))

  • 33/60

    edgeR Example Run

    # Load Dexamethasone dataset on smooth muscle cells

    load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))

    # Load edgeR package (install if necessary, google the package name)

    library(edgeR)

  • 34/60

    edgeR Example Run

    # Load Dexamethasone dataset on smooth muscle cells

    load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))

    # Load edgeR package (install if necessary, google the package name)

    library(edgeR)

    # Read Vignette

    browseVignettes("edgeR")

  • 35/60

    edgeR Example Run

    # Load Dexamethasone dataset on smooth muscle cells

    load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))

    # Load edgeR package (install if necessary, google the package name)

    library(edgeR)

    # Read Vignette

    browseVignettes("edgeR")

  • 36/60

    edgeR Example Run

    # Load Dexamethasone dataset on smooth muscle cells

    load(url("https://www.dropbox.com/s/hkbo6ih2je1hpoc/muscle.rda?dl=1"))

    # Load edgeR package (install if necessary, google the package name)

    library(edgeR)

    # Read Vignette

    browseVignettes("edgeR")

    • edgeR is usually FASTER than DESeq2

    • edgeR allows for more complex designs (e.g.

    treatment at different time points on different

    genotypes)

    • edgeR is more stringent (higher p-values)

  • 37/60

    edgeR: Annotation Object (same as DESeq2)

    # Define Dexamethasone treatment annotation

    dex

  • 38/60

    edgeR: set up the Experimental Design

    # Design Structure

    design

  • 39/60

    edgeR: set up the Experimental Design and Normalize

    # Design Structure

    design

  • 40/60

    edgeR: Run the Analysis

    # Run edgeR test

    fit

  • 41/60

    edgeR Example Output

    View(res2)

    Gene

    SymbolsLog Mean

    Expression (A)

    LogFC (M)

    edgeR

    Likelihood

    Ratio Statistics

    p-value

    FDR-

    corrected

    p-value

  • 42/60

    Exercises Part 1! edgeR vs. DESeq2

    • Compare the Results of DESeq2 and edgeR

    – Scatterplot and Pearson Correlation of logFC (cor.test)

    ## Get vectors of LogFC

    de_fc

  • 43/60

    Exercises Part 2! edgeR vs. DESeq2

    • Compare the Results of DESeq2 and edgeR

    – Intersection (p adjusted

  • 44/60

    Exercises Part 3! edgeR vs. DESeq2

    • Compare the Results of DESeq2 and edgeR

    – Venn Diagram of similarity (see below)

    – Create two Venn Diagrams: significant genes (adjusted p0) and downregulated (logFC

  • 45/60

    Basic Machine Learning

    DNAGenomic

    Alteration

    causesRNA

    Δ Gene Expression

    Expression-Based predictors of

    Genomic Alterations in Cancer

  • 46/60

    Basic Machine Learning

    DNAGenomic

    Alteration

    RNAΔ Gene

    Expression

    Expression-Based predictors of

    Genomic Alterations in Cancer

    can predict the presence

  • 47/60

    Performance of Several Machine Learning Algorithms

  • 48/60

    The caret package (short for Classification And REgression Training) is a set

    of functions that attempt to streamline the process for creating predictive

    models. The package contains tools for:

    • data splitting

    • pre-processing

    • feature selection

    • model tuning using resampling

    • variable importance estimation

    The R caret package

  • 49/60

    The Machine Learning problem of today

    TP53

    Mutations

    (DNA)

    Can they be predicted by

    Gene Expression Profiles?

    (RNA)

    In this case:

    • DNA mutations are considered categorical (yes/no 1/0)

    • RNA Gene Expression Profiles are continuous (and already

    normalized)

    Context: Breast Cancer Human Samples

  • 50/60

    • Get Predictive Features (Expression Matrix)

    • Get Outcome to Predict (TP53 Mutation Track)

    • Make them comparable (use only samples where both were measured,

    and keep them in the same order)

    The Machine Learning Setup - Step 1

    ### Expression data (the features)

    load(url("https://www.dropbox.com/s/rp6maq49d8t3via/expmat.rda?dl=1"))

    ### Mutation data (what to predict)

    load(url("https://www.dropbox.com/s/lju1cy3633rmw1q/tp53mut.rda?dl=1"))

    ### Colnames are not harmonius

    # Consider only the first 15 characters

    colnames(expmat)

  • 51/60

    • Brutal Feature Selection to speed up computation: keep only the 1000

    genes with the highest variance

    • Build data object (data frame with both GEPs and mutations)

    • Define what you want to predict (here, the presence of TP53 mutation)

    The Machine Learning Setup - Step 2

    ### Keep only the 1000 most variable genes

    vars

  • 52/60

    • Load the caret package

    • Specify which feature to predict and which features are the predictors

    The Machine Learning Setup - Step 3

    ### Caret package

    library(caret)

    # We want to predict the outcome (the mutations), from a set of

    # predictors (gene expression data), which we code here

    prop.table(table(df[,"tp53mut"]))

    outcomeName

  • 53/60

    • Select a Machine Learning Algorithm

    • Full list at https://rdrr.io/cran/caret/man/models.html

    The Machine Learning Setup - Step 4

    method

  • 54/60

    • Set a seed (for reproducibility)

    • Divide the dataset into two independent pieces

    – 75% Training Set (to tune the parameters)

    – 25% Test Set (to test the model)

    • The two sets should have the same mutation/wt ratio

    The Machine Learning Setup - Step 5

    # Set a seed to use, to make the CV steps reproducible

    set.seed(1)

    # Split the data into training and test sets (0.6, 0.8, try different ones?)

    splitIndex

  • 55/60

    • Train Model Parameters inside the Training Set

    – 10-fold Cross-Validation (split the Training Set in 10 pieces: train using 9

    pieces, leaving out the 10th, repeat it 10 times)

    – Alternative (slower): Leave-One-Out (LOO). Keep one sample out and use

    the others to predict

    The Machine Learning Setup - Step 6

    # Train Model Parameters (10-fold CV is acceptable)

    objControl

  • 56/60

    • Model Validation

    – We have a model trained on the Training Set

    – We test it on the Test Set

    – This will yield a vector of predictions for the new samples (mut or wt) or a

    vector of probabilities (chance to be mutated)

    The Machine Learning Setup - Step 7

    # Validating the Model

    # It will predict using the best model: objModel$finalModel

    predictions

  • 57/60

    • Performance Evaluation

    – Confusion Matrix (How many True Positives, False Positives, False Negatives,

    etc.)

    – Sensitivity, Specificity

    The Machine Learning Setup - Step 8

    # Confusion matrix

    confusionMatrix(predictions,testDF$tp53mut)

  • 58/60

    • Performance Evaluation

    – Confusion Matrix (How many True Positives, False Positives, False Negatives,

    etc.)

    – Sensitivity, Specificity

    The Machine Learning Setup - Step 8

    # Confusion matrix

    confusionMatrix(predictions,testDF$tp53mut)

    True Positives

    False

    Negatives

    True

    Negatives

    False Positives

    Random models

    have Spec, Sens

    and Acc = 0.5

  • 59/60

    • Receiving Operator Characteristic: ROC Curve– Shows Specificty and Sensitivity Profiles while varying model parameters

    – It will allow to show a different tradeoff (more FN, less FP, or vice versa)

    The Machine Learning Setup - Step 9

    # Calculate a ROC curve

    library(pROC)

    rocCurve

  • 60/60

    • Receiving Operator Characteristic: ROC Curve– Shows Specificty and Sensitivity Profiles while varying model parameters

    – It will allow to show a different tradeoff (more FN, less FP, or vice versa)

    The Machine Learning Setup - Step 9

    # Calculate a ROC curve

    library(pROC)

    rocCurve

  • 61/60

    • Receiving Operator Characteristic: ROC Curve– Shows Specificty and Sensitivity Profiles while varying model parameters

    – It will allow to show a different tradeoff (more FN, less FP, or vice versa)

    The Machine Learning Setup - Step 9

    # Calculate a ROC curve

    library(pROC)

    rocCurve

  • 62/60

    Exercises! Machine Learning

    • Repeat the Machine Learning TP53 prediction of mutation

    – Adding noise to the expression data (with rnorm)

    – With a different method (here I used gradient boost modelling)

    • full list at https://rdrr.io/cran/caret/man/models.html

    https://rdrr.io/cran/caret/man/models.html

  • 63/60

    Further Reading...

    www.coursera.org

  • www.giorgilab.org

    Federico M. Giorgi, PhD

    Department of Pharmacy and Biotechnology

    [email protected]

    to EF, bringer of salted breakfasts

  • 65/60

    Solutions (Big Dataset T-test)### Exercise

    load(url("https://www.dropbox.com/s/rp6maq49d8t3via/expmat.rda?dl=1"))

    load(url("https://www.dropbox.com/s/eb24au609ejh8v1/tumnorm.rda?dl=1"))

    tumormat

  • 66/60

    Solutions (Small Dataset T-test)## Small Dataset T-test

    treated

  • 67/60

    Solutions (edgeR vs. DESeq2)## Get vectors of LogFC

    de_fc

  • 68/60

    Lorem ipsum

    # Get the Breast Cancer dataset

    load(url("https://www.dropbox.com/s/nrxrsq8m0gfjwqh/tcga_BRCA-expmat.rda?dl=1"))

    Let's test the difference between normal and tumor samples for a gene

    # Get the annotation:

    Let's test the difference between normal and tumor samples for a gene

    # Get the annotation:

  • 69/60

  • 70/60

    Exercises!

    • Do something

  • 71/60

    Solutions

    # Some solution

    s