differential methylation analysis simon andrews simon.andrews@babraham.ac.uk @simon_andrews 1

Differential Methylation Analysis

Simon Andrewssimon.andrews@babraham.ac.uk

@simon_andrews

A basic question…

Factors to consider

• Number of observations• Magnitude of effect• Technical considerations• Biological variability• Biological common sense

The problem of power…• Ideally want to cover every Cytosine (CpG)• Have to correct for the number of tests

• There’s no way you’ll collect enough data to analyse each C and have p-values which survive multiple testing correction

• Stats have to find a way to work round this.

Maximising power• Options– Analyse in windows– Pre-filter– Hierarchical or Adaptive filtering

Window sizes

Small windows• Good resolution• Specific biological effects• High MTC burden• Small observations• High p-values

Large windows• Lots of data• High statistical power• Low MTC burden• Low p-values• Effect averaging

Simple Statistical Approach

• Is the proportion of methylated calls different between two samples, given the number of observations?

Meth count A Unmeth count A Meth count B Unmeth count B % change Significant?

2 0 0 2 100 No

200 2 198 5 1.5 No

100 50 75 60 11 Probably

Contingency tests• Chi-square / G-test / Fisher’s exact test– Differ only at low observations– Significant changes require enough observations

that any of these should give the same answer• Operates on single replicates• Technical measure of difference

Meth A Unmeth A

Meth B Unmeth B

Chi-Square results

Biological considerations

• Minimum relevant effect size?– Balance power vs change– What makes biological sense – (what would you follow up?)

• Minimum coverage worth testing– No point testing poorly covered regions

Effect of pre-filtering

Distribution of methylation

Chi square assumes a normal distribution, and methylation data isn’t normally distributed

Beta binomial distribution

More relevant statistics than chi-square. Need to fit custom model to actual data.

Implications of a beta distribution• Many summaries assume normality– Mean– Standard Deviation– Boxplots

• None of these is strictly appropriate when looking at methylation data

Dealing with replicates• Simple approach

– Merge data from replicates together– Single test, High power– Post-hoc test for consistency

• Explicitly account for batch effects– Logistic regression– Measures batch effects and excludes them from final significance

calculation

• Work with methylation values– Normalise percentage methylation values– Use conventional statistics (t-tests etc) for comparing groups

Hierarchical testing

• Test larger regions– Windows / Features etc.

• Take significant hits and subdivide– Smaller windows– Individual CpGs– Correct only for these tests

• Assemble hits together to make up DMRs

Hierarchical testing

GenomeCGI CGI CGI CGI CGI CGI

GenomeCGI CGI CGI CGI CGI CGIX X X XGenomeCGI CGI CGI CGI CGI CGIX X X X

Statistically ‘creative’ solution to not having enough data

Methylation statistics packages• swDMR (Perl/R-package)

Sliding window DMR finding (choose between t_test, Kolmogorov, Fisher, ChiSquare, Wilcoxon for n = 2; ANOVA, Kruskal for n > 3)

• methylKit* (R-package by A. Akalin et al.)Sliding window, Fisher’s exact test or logistic regression. Adjusts p-values to q-values using SLIM method.

• bsseq* (R/Bioconductor by K.D. Hansen) Implements the BSmooth smoothing algorithm. Numerous CpG-wise t-tests and p-value cutoff to define DMRs. Outperforms Fisher’s exact

test. Requires biological replicates for DMR detection

• BiSeq* (R/Bioconductor by K. Hebestreit et al.)Beta regression model, impractical for very large data other than RRBS or targeted BS-Seq

• RnBeads* (R package by F. Mueller et al.) works for 450K arrays, BS-Seq, MeDIP or MBD-Seq data

• DMAP* (C command line tool by P. Stockwell et al.)RRBS fragment or fixed window approach, Fisher’s exact test, Chi-squared or ANOVA

• RADMeth (C++ command line tool by E. Dolzhenko and A.D. Smith)Beta-binomial regression analysis to find DMCs or DMRs, local likelihood, adjust for neighbouring CpGs

• MOABS* (C++ command line tool by D. Sun et al.)Beta binomial hierarchical model to capture sampling and biological variation, Credible Methylation Difference (CDIF) single metric that

combines biological and statistical significance

• ComMet (Y. Saito et al., 2014)Bisulfighter suite; DMR detection based on hidden Markov models (HMMs) that enable automated adjustment of DMC chaining criteria. Does not require

biological replicates

• DSS (R/Bioconductor by Feng et al., 2014)Constructs genome-wide prior distribution for beta-binomial dispersion. Bayesian hierarchical model to detect differentially methylated loci

• more appearing every other week… * interface well with

Tool Statistical test Suitable for Implementation Notes

bsseq Sample-wise smoothing, then group differences via CpG-wise t-tests (p-value cutoff to define adjacent CpG sites as DMRs)

WGBS; not designed for targeted BS-Seq or RRBS

R package/Bioconductor

Outperforms Fisher’s exact test; intended to compare 2 groups;replicates required

BiSeq Define CpG clusters, smooth methylation data, model and test group effect (fitting beta regression model to smoothed methylation levels and testing for group effect using the Wald test), hierarchical testing procedure on CpG clusters, then define DMR boundaries

RRBS; targeted BS-Seq; for WGBS

R package/Bioconductor

Very computationally intensive; Not limited to 2 groups

MethylKit Models CpG methylation within a logistic regression. Sliding linear model (SLIM) to correct for multiple testing

(e)RRBS R package

* WGBS = whole genome BS-Seq; (e)RRBS = (enhanced) reduced representation BS-Seq

bsseq – for whole genome BS-Seq• Smoothing of low coverage BS-Seq first to get reliable semi-local

methylation estimation estimates

• Not suitable for captured or restricted data

• After smoothing it uses biological replicates to estimate biological variation and identify methylated regions (DMRs)

• Smoothing suitable for even a single sample

• Works for CpG context in humans, will probably not scale to 2x585M Cs in non-CG context

BSmooth algorithm

black: 25x (Lister)pink: 4x (Lister)

Bsmooth t-values

differential methylation analysis simon andrews simon.andrews@babraham.ac.uk @simon_andrews 1

Documents

extracting biological information from gene lists simon...

onenote laboratory notebook tutorial - babraham bioinf...

presenting results laura biggins...

analysing 10x single cell rna-seq data€¦ · analysing...

visualising and exploring bs-seq data simon andrews...

scientific figure design v2.0 simon andrews, anne...

visualising and exploring bs-seq data - babraham bioinf ·...

rna-seq analysis simon andrews simon.andrews@babraham.ac.uk...

chicago: statistical methodology for signal detection in...

seqmonk tools for methylation analysis simon andrews...

chip-seq data processing and qc - babraham bioinf ·...

networks and interactions boo virk boo.virk@babraham.ac.uk...

inkscape tutorial - babraham bioinformatics · pdf...

commercial tools for gene list analysis boo virk...

bisulfite-sequencing theory and quality control felix...

seqmonk: ngs analysis on your desktop - babraham...

dimension reduction pca, tsne, umap€¦ · dimension...

exploring and understanding chip-seq data · exploring and...

seqmonk: ngs analysis on your desktop - babraham … andrews...