dna methylation analysis in r
DESCRIPTION
Tutorial for analyzing DNA methylation data from bisulfite sequencing with RTRANSCRIPT
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Analyzing RRBS methylation data with RBasic analysis with R package methylKit
Altuna Akalınala2027medcornelledu
EpiWorkshop 2013Weill Cornell Medical College
Institute for Computational Biomedicine
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Outline
1 Introduction
2 R Basics
3 Genomics and R
4 RRBS analysis with R package methylKitWorking with annotated methylation events
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Introduction
This tutorial will show how to analyze methylation data obtained fromReduced Representation Bisulfite Sequencing (RRBS) experimentsFirst we will introduce
R basicsGenomics and R Bioconductor packages that will be usefulwhen working with genomic intervalsHow to use the package methylKit to analyze methylationdata
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Before we start
Download and unzip the data folderhttpmethylkitgooglecodecomfilesmethylKitTutorialData_feb2012zip
Make sure that you installed R 2152 and RStudiofor Mac OSX httpcranr-projectorgbinmacosxoldR-2152pkgfor windows httpcranr-projectorgbinwindowsbaseold2152For linux httpcranr-projectorgsrcbaseR-2R-2152targzhttpwwwrstudiocomidedownload
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
R Basics
Here are some basic R operations and data structures that will begood to know if you do not have prior experience with R
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Computations in RR can be used as an ordinary calculator There are a few examples
2 + 3 5 Note the order of operations
[1] 17
log(10) Natural logarithm with base e=2718282
[1] 2303
4^2 4 raised to the second power
[1] 16
32 Division
[1] 15
sqrt(16) Square root
[1] 4
abs(3 - 7) Absolute value of 3-7
[1] 4
pi The mysterious number
[1] 3142
exp(2) exponential function
[1] 7389
This is a comment line
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
VectorsR handles vectors easily and intuitively
x lt- c(1 3 2 10 5) create a vector x with 5 componentsx
[1] 1 3 2 10 5
y lt- 15 create a vector of consecutive integersy
[1] 1 2 3 4 5
y + 2 scalar addition
[1] 3 4 5 6 7
2 y scalar multiplication
[1] 2 4 6 8 10
y^2 raise each component to the second power
[1] 1 4 9 16 25
2^y raise 2 to the first through fifth power
[1] 2 4 8 16 32
y y itself has not been unchanged
[1] 1 2 3 4 5
y lt- y 2y it is now changed
[1] 2 4 6 8 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Matrices
A matrix refers to a numeric array of rows and columns One of theeasiest ways to create a matrix is to combine vectors of equal lengthusing cbind() meaning column bind
x lt- c(1 2 3)y lt- c(4 5 6)m1 lt- cbind(x y)m1
x y [1] 1 4 [2] 2 5 [3] 3 6
t(m1) transpose of m1
[1] [2] [3] x 1 2 3 y 4 5 6
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Matrices
dim(m1) 2 by 3 matrix
[1] 3 2
You can also directly list the elements and specify the matrix
m2 lt- matrix(c(1 3 2 5 -1 2 2 3 9) nrow = 3)m2
[1] [2] [3] [1] 1 5 2 [2] 3 -1 3 [3] 2 2 9
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
A data frame is more general than a matrix in that different columnscan have different modes (numeric character factor etc) Small tomoderate size data frame can be constructed by dataframe()function For example we illustrate how to construct a data framefrom genomic intervals or coordinates
chr lt- c(chr1 chr1 chr2 chr2)strand lt- c(- - + +)start lt- c(200 4000 100 400)end lt- c(250 410 200 450)mydata lt- dataframe(chr start end strand)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
names(mydata) lt- c(chr start end strand) variable namesmydata
chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +
OR this will work toomydata lt- dataframe(chr = chr start = start
end = end strand = strand)mydata
chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
There are a variety of ways to identify the elements of a data frame
mydata[24] columns 234 of data frame
start end strand 1 200 250 - 2 4000 410 - 3 100 200 + 4 400 450 +
columns chr and start from data framemydata[c(chr start)]
chr start 1 chr1 200 2 chr1 4000 3 chr2 100 4 chr2 400
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
mydata$start variable start in the data frame
[1] 200 4000 100 400
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name
example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)
mymatrix = matrix(14 ncol = 2) age = 53)w
$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Lists
You can extract elements of a list using the [[]] convention
w[[3]] 3rd component of the list
[1] [2] [1] 1 3 [2] 2 4
w[[mynumbers]] component named mynumbers in list
[1] 1 2 3
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
R has great support for plotting and customizing plots We will showonly a few below
x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
Saving plots
pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR
plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting help on R functionscommands
Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing packages
Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom
How to install packagesinstallpackages(devtools)
source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)
Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Genomics and R
Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()
read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt
header = FALSE)
write data frames to text fileswritetable(cpgidf exampleouttxt)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1
ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1
ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )
find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))
Histogram of width(cpgi)
width(cpgi)
Fre
quen
cy
0 2000 6000 10000
050
010
00
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Outline
1 Introduction
2 R Basics
3 Genomics and R
4 RRBS analysis with R package methylKitWorking with annotated methylation events
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Introduction
This tutorial will show how to analyze methylation data obtained fromReduced Representation Bisulfite Sequencing (RRBS) experimentsFirst we will introduce
R basicsGenomics and R Bioconductor packages that will be usefulwhen working with genomic intervalsHow to use the package methylKit to analyze methylationdata
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Before we start
Download and unzip the data folderhttpmethylkitgooglecodecomfilesmethylKitTutorialData_feb2012zip
Make sure that you installed R 2152 and RStudiofor Mac OSX httpcranr-projectorgbinmacosxoldR-2152pkgfor windows httpcranr-projectorgbinwindowsbaseold2152For linux httpcranr-projectorgsrcbaseR-2R-2152targzhttpwwwrstudiocomidedownload
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
R Basics
Here are some basic R operations and data structures that will begood to know if you do not have prior experience with R
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Computations in RR can be used as an ordinary calculator There are a few examples
2 + 3 5 Note the order of operations
[1] 17
log(10) Natural logarithm with base e=2718282
[1] 2303
4^2 4 raised to the second power
[1] 16
32 Division
[1] 15
sqrt(16) Square root
[1] 4
abs(3 - 7) Absolute value of 3-7
[1] 4
pi The mysterious number
[1] 3142
exp(2) exponential function
[1] 7389
This is a comment line
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
VectorsR handles vectors easily and intuitively
x lt- c(1 3 2 10 5) create a vector x with 5 componentsx
[1] 1 3 2 10 5
y lt- 15 create a vector of consecutive integersy
[1] 1 2 3 4 5
y + 2 scalar addition
[1] 3 4 5 6 7
2 y scalar multiplication
[1] 2 4 6 8 10
y^2 raise each component to the second power
[1] 1 4 9 16 25
2^y raise 2 to the first through fifth power
[1] 2 4 8 16 32
y y itself has not been unchanged
[1] 1 2 3 4 5
y lt- y 2y it is now changed
[1] 2 4 6 8 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Matrices
A matrix refers to a numeric array of rows and columns One of theeasiest ways to create a matrix is to combine vectors of equal lengthusing cbind() meaning column bind
x lt- c(1 2 3)y lt- c(4 5 6)m1 lt- cbind(x y)m1
x y [1] 1 4 [2] 2 5 [3] 3 6
t(m1) transpose of m1
[1] [2] [3] x 1 2 3 y 4 5 6
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Matrices
dim(m1) 2 by 3 matrix
[1] 3 2
You can also directly list the elements and specify the matrix
m2 lt- matrix(c(1 3 2 5 -1 2 2 3 9) nrow = 3)m2
[1] [2] [3] [1] 1 5 2 [2] 3 -1 3 [3] 2 2 9
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
A data frame is more general than a matrix in that different columnscan have different modes (numeric character factor etc) Small tomoderate size data frame can be constructed by dataframe()function For example we illustrate how to construct a data framefrom genomic intervals or coordinates
chr lt- c(chr1 chr1 chr2 chr2)strand lt- c(- - + +)start lt- c(200 4000 100 400)end lt- c(250 410 200 450)mydata lt- dataframe(chr start end strand)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
names(mydata) lt- c(chr start end strand) variable namesmydata
chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +
OR this will work toomydata lt- dataframe(chr = chr start = start
end = end strand = strand)mydata
chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
There are a variety of ways to identify the elements of a data frame
mydata[24] columns 234 of data frame
start end strand 1 200 250 - 2 4000 410 - 3 100 200 + 4 400 450 +
columns chr and start from data framemydata[c(chr start)]
chr start 1 chr1 200 2 chr1 4000 3 chr2 100 4 chr2 400
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
mydata$start variable start in the data frame
[1] 200 4000 100 400
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name
example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)
mymatrix = matrix(14 ncol = 2) age = 53)w
$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Lists
You can extract elements of a list using the [[]] convention
w[[3]] 3rd component of the list
[1] [2] [1] 1 3 [2] 2 4
w[[mynumbers]] component named mynumbers in list
[1] 1 2 3
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
R has great support for plotting and customizing plots We will showonly a few below
x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
Saving plots
pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR
plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting help on R functionscommands
Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing packages
Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom
How to install packagesinstallpackages(devtools)
source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)
Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Genomics and R
Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()
read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt
header = FALSE)
write data frames to text fileswritetable(cpgidf exampleouttxt)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1
ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1
ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )
find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))
Histogram of width(cpgi)
width(cpgi)
Fre
quen
cy
0 2000 6000 10000
050
010
00
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Introduction
This tutorial will show how to analyze methylation data obtained fromReduced Representation Bisulfite Sequencing (RRBS) experimentsFirst we will introduce
R basicsGenomics and R Bioconductor packages that will be usefulwhen working with genomic intervalsHow to use the package methylKit to analyze methylationdata
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Before we start
Download and unzip the data folderhttpmethylkitgooglecodecomfilesmethylKitTutorialData_feb2012zip
Make sure that you installed R 2152 and RStudiofor Mac OSX httpcranr-projectorgbinmacosxoldR-2152pkgfor windows httpcranr-projectorgbinwindowsbaseold2152For linux httpcranr-projectorgsrcbaseR-2R-2152targzhttpwwwrstudiocomidedownload
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
R Basics
Here are some basic R operations and data structures that will begood to know if you do not have prior experience with R
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Computations in RR can be used as an ordinary calculator There are a few examples
2 + 3 5 Note the order of operations
[1] 17
log(10) Natural logarithm with base e=2718282
[1] 2303
4^2 4 raised to the second power
[1] 16
32 Division
[1] 15
sqrt(16) Square root
[1] 4
abs(3 - 7) Absolute value of 3-7
[1] 4
pi The mysterious number
[1] 3142
exp(2) exponential function
[1] 7389
This is a comment line
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
VectorsR handles vectors easily and intuitively
x lt- c(1 3 2 10 5) create a vector x with 5 componentsx
[1] 1 3 2 10 5
y lt- 15 create a vector of consecutive integersy
[1] 1 2 3 4 5
y + 2 scalar addition
[1] 3 4 5 6 7
2 y scalar multiplication
[1] 2 4 6 8 10
y^2 raise each component to the second power
[1] 1 4 9 16 25
2^y raise 2 to the first through fifth power
[1] 2 4 8 16 32
y y itself has not been unchanged
[1] 1 2 3 4 5
y lt- y 2y it is now changed
[1] 2 4 6 8 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Matrices
A matrix refers to a numeric array of rows and columns One of theeasiest ways to create a matrix is to combine vectors of equal lengthusing cbind() meaning column bind
x lt- c(1 2 3)y lt- c(4 5 6)m1 lt- cbind(x y)m1
x y [1] 1 4 [2] 2 5 [3] 3 6
t(m1) transpose of m1
[1] [2] [3] x 1 2 3 y 4 5 6
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Matrices
dim(m1) 2 by 3 matrix
[1] 3 2
You can also directly list the elements and specify the matrix
m2 lt- matrix(c(1 3 2 5 -1 2 2 3 9) nrow = 3)m2
[1] [2] [3] [1] 1 5 2 [2] 3 -1 3 [3] 2 2 9
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
A data frame is more general than a matrix in that different columnscan have different modes (numeric character factor etc) Small tomoderate size data frame can be constructed by dataframe()function For example we illustrate how to construct a data framefrom genomic intervals or coordinates
chr lt- c(chr1 chr1 chr2 chr2)strand lt- c(- - + +)start lt- c(200 4000 100 400)end lt- c(250 410 200 450)mydata lt- dataframe(chr start end strand)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
names(mydata) lt- c(chr start end strand) variable namesmydata
chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +
OR this will work toomydata lt- dataframe(chr = chr start = start
end = end strand = strand)mydata
chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
There are a variety of ways to identify the elements of a data frame
mydata[24] columns 234 of data frame
start end strand 1 200 250 - 2 4000 410 - 3 100 200 + 4 400 450 +
columns chr and start from data framemydata[c(chr start)]
chr start 1 chr1 200 2 chr1 4000 3 chr2 100 4 chr2 400
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
mydata$start variable start in the data frame
[1] 200 4000 100 400
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name
example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)
mymatrix = matrix(14 ncol = 2) age = 53)w
$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Lists
You can extract elements of a list using the [[]] convention
w[[3]] 3rd component of the list
[1] [2] [1] 1 3 [2] 2 4
w[[mynumbers]] component named mynumbers in list
[1] 1 2 3
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
R has great support for plotting and customizing plots We will showonly a few below
x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
Saving plots
pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR
plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting help on R functionscommands
Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing packages
Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom
How to install packagesinstallpackages(devtools)
source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)
Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Genomics and R
Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()
read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt
header = FALSE)
write data frames to text fileswritetable(cpgidf exampleouttxt)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1
ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1
ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )
find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))
Histogram of width(cpgi)
width(cpgi)
Fre
quen
cy
0 2000 6000 10000
050
010
00
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Before we start
Download and unzip the data folderhttpmethylkitgooglecodecomfilesmethylKitTutorialData_feb2012zip
Make sure that you installed R 2152 and RStudiofor Mac OSX httpcranr-projectorgbinmacosxoldR-2152pkgfor windows httpcranr-projectorgbinwindowsbaseold2152For linux httpcranr-projectorgsrcbaseR-2R-2152targzhttpwwwrstudiocomidedownload
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
R Basics
Here are some basic R operations and data structures that will begood to know if you do not have prior experience with R
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Computations in RR can be used as an ordinary calculator There are a few examples
2 + 3 5 Note the order of operations
[1] 17
log(10) Natural logarithm with base e=2718282
[1] 2303
4^2 4 raised to the second power
[1] 16
32 Division
[1] 15
sqrt(16) Square root
[1] 4
abs(3 - 7) Absolute value of 3-7
[1] 4
pi The mysterious number
[1] 3142
exp(2) exponential function
[1] 7389
This is a comment line
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
VectorsR handles vectors easily and intuitively
x lt- c(1 3 2 10 5) create a vector x with 5 componentsx
[1] 1 3 2 10 5
y lt- 15 create a vector of consecutive integersy
[1] 1 2 3 4 5
y + 2 scalar addition
[1] 3 4 5 6 7
2 y scalar multiplication
[1] 2 4 6 8 10
y^2 raise each component to the second power
[1] 1 4 9 16 25
2^y raise 2 to the first through fifth power
[1] 2 4 8 16 32
y y itself has not been unchanged
[1] 1 2 3 4 5
y lt- y 2y it is now changed
[1] 2 4 6 8 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Matrices
A matrix refers to a numeric array of rows and columns One of theeasiest ways to create a matrix is to combine vectors of equal lengthusing cbind() meaning column bind
x lt- c(1 2 3)y lt- c(4 5 6)m1 lt- cbind(x y)m1
x y [1] 1 4 [2] 2 5 [3] 3 6
t(m1) transpose of m1
[1] [2] [3] x 1 2 3 y 4 5 6
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Matrices
dim(m1) 2 by 3 matrix
[1] 3 2
You can also directly list the elements and specify the matrix
m2 lt- matrix(c(1 3 2 5 -1 2 2 3 9) nrow = 3)m2
[1] [2] [3] [1] 1 5 2 [2] 3 -1 3 [3] 2 2 9
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
A data frame is more general than a matrix in that different columnscan have different modes (numeric character factor etc) Small tomoderate size data frame can be constructed by dataframe()function For example we illustrate how to construct a data framefrom genomic intervals or coordinates
chr lt- c(chr1 chr1 chr2 chr2)strand lt- c(- - + +)start lt- c(200 4000 100 400)end lt- c(250 410 200 450)mydata lt- dataframe(chr start end strand)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
names(mydata) lt- c(chr start end strand) variable namesmydata
chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +
OR this will work toomydata lt- dataframe(chr = chr start = start
end = end strand = strand)mydata
chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
There are a variety of ways to identify the elements of a data frame
mydata[24] columns 234 of data frame
start end strand 1 200 250 - 2 4000 410 - 3 100 200 + 4 400 450 +
columns chr and start from data framemydata[c(chr start)]
chr start 1 chr1 200 2 chr1 4000 3 chr2 100 4 chr2 400
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
mydata$start variable start in the data frame
[1] 200 4000 100 400
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name
example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)
mymatrix = matrix(14 ncol = 2) age = 53)w
$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Lists
You can extract elements of a list using the [[]] convention
w[[3]] 3rd component of the list
[1] [2] [1] 1 3 [2] 2 4
w[[mynumbers]] component named mynumbers in list
[1] 1 2 3
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
R has great support for plotting and customizing plots We will showonly a few below
x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
Saving plots
pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR
plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting help on R functionscommands
Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing packages
Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom
How to install packagesinstallpackages(devtools)
source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)
Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Genomics and R
Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()
read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt
header = FALSE)
write data frames to text fileswritetable(cpgidf exampleouttxt)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1
ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1
ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )
find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))
Histogram of width(cpgi)
width(cpgi)
Fre
quen
cy
0 2000 6000 10000
050
010
00
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
R Basics
Here are some basic R operations and data structures that will begood to know if you do not have prior experience with R
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Computations in RR can be used as an ordinary calculator There are a few examples
2 + 3 5 Note the order of operations
[1] 17
log(10) Natural logarithm with base e=2718282
[1] 2303
4^2 4 raised to the second power
[1] 16
32 Division
[1] 15
sqrt(16) Square root
[1] 4
abs(3 - 7) Absolute value of 3-7
[1] 4
pi The mysterious number
[1] 3142
exp(2) exponential function
[1] 7389
This is a comment line
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
VectorsR handles vectors easily and intuitively
x lt- c(1 3 2 10 5) create a vector x with 5 componentsx
[1] 1 3 2 10 5
y lt- 15 create a vector of consecutive integersy
[1] 1 2 3 4 5
y + 2 scalar addition
[1] 3 4 5 6 7
2 y scalar multiplication
[1] 2 4 6 8 10
y^2 raise each component to the second power
[1] 1 4 9 16 25
2^y raise 2 to the first through fifth power
[1] 2 4 8 16 32
y y itself has not been unchanged
[1] 1 2 3 4 5
y lt- y 2y it is now changed
[1] 2 4 6 8 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Matrices
A matrix refers to a numeric array of rows and columns One of theeasiest ways to create a matrix is to combine vectors of equal lengthusing cbind() meaning column bind
x lt- c(1 2 3)y lt- c(4 5 6)m1 lt- cbind(x y)m1
x y [1] 1 4 [2] 2 5 [3] 3 6
t(m1) transpose of m1
[1] [2] [3] x 1 2 3 y 4 5 6
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Matrices
dim(m1) 2 by 3 matrix
[1] 3 2
You can also directly list the elements and specify the matrix
m2 lt- matrix(c(1 3 2 5 -1 2 2 3 9) nrow = 3)m2
[1] [2] [3] [1] 1 5 2 [2] 3 -1 3 [3] 2 2 9
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
A data frame is more general than a matrix in that different columnscan have different modes (numeric character factor etc) Small tomoderate size data frame can be constructed by dataframe()function For example we illustrate how to construct a data framefrom genomic intervals or coordinates
chr lt- c(chr1 chr1 chr2 chr2)strand lt- c(- - + +)start lt- c(200 4000 100 400)end lt- c(250 410 200 450)mydata lt- dataframe(chr start end strand)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
names(mydata) lt- c(chr start end strand) variable namesmydata
chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +
OR this will work toomydata lt- dataframe(chr = chr start = start
end = end strand = strand)mydata
chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
There are a variety of ways to identify the elements of a data frame
mydata[24] columns 234 of data frame
start end strand 1 200 250 - 2 4000 410 - 3 100 200 + 4 400 450 +
columns chr and start from data framemydata[c(chr start)]
chr start 1 chr1 200 2 chr1 4000 3 chr2 100 4 chr2 400
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
mydata$start variable start in the data frame
[1] 200 4000 100 400
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name
example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)
mymatrix = matrix(14 ncol = 2) age = 53)w
$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Lists
You can extract elements of a list using the [[]] convention
w[[3]] 3rd component of the list
[1] [2] [1] 1 3 [2] 2 4
w[[mynumbers]] component named mynumbers in list
[1] 1 2 3
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
R has great support for plotting and customizing plots We will showonly a few below
x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
Saving plots
pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR
plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting help on R functionscommands
Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing packages
Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom
How to install packagesinstallpackages(devtools)
source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)
Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Genomics and R
Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()
read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt
header = FALSE)
write data frames to text fileswritetable(cpgidf exampleouttxt)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1
ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1
ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )
find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))
Histogram of width(cpgi)
width(cpgi)
Fre
quen
cy
0 2000 6000 10000
050
010
00
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Computations in RR can be used as an ordinary calculator There are a few examples
2 + 3 5 Note the order of operations
[1] 17
log(10) Natural logarithm with base e=2718282
[1] 2303
4^2 4 raised to the second power
[1] 16
32 Division
[1] 15
sqrt(16) Square root
[1] 4
abs(3 - 7) Absolute value of 3-7
[1] 4
pi The mysterious number
[1] 3142
exp(2) exponential function
[1] 7389
This is a comment line
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
VectorsR handles vectors easily and intuitively
x lt- c(1 3 2 10 5) create a vector x with 5 componentsx
[1] 1 3 2 10 5
y lt- 15 create a vector of consecutive integersy
[1] 1 2 3 4 5
y + 2 scalar addition
[1] 3 4 5 6 7
2 y scalar multiplication
[1] 2 4 6 8 10
y^2 raise each component to the second power
[1] 1 4 9 16 25
2^y raise 2 to the first through fifth power
[1] 2 4 8 16 32
y y itself has not been unchanged
[1] 1 2 3 4 5
y lt- y 2y it is now changed
[1] 2 4 6 8 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Matrices
A matrix refers to a numeric array of rows and columns One of theeasiest ways to create a matrix is to combine vectors of equal lengthusing cbind() meaning column bind
x lt- c(1 2 3)y lt- c(4 5 6)m1 lt- cbind(x y)m1
x y [1] 1 4 [2] 2 5 [3] 3 6
t(m1) transpose of m1
[1] [2] [3] x 1 2 3 y 4 5 6
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Matrices
dim(m1) 2 by 3 matrix
[1] 3 2
You can also directly list the elements and specify the matrix
m2 lt- matrix(c(1 3 2 5 -1 2 2 3 9) nrow = 3)m2
[1] [2] [3] [1] 1 5 2 [2] 3 -1 3 [3] 2 2 9
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
A data frame is more general than a matrix in that different columnscan have different modes (numeric character factor etc) Small tomoderate size data frame can be constructed by dataframe()function For example we illustrate how to construct a data framefrom genomic intervals or coordinates
chr lt- c(chr1 chr1 chr2 chr2)strand lt- c(- - + +)start lt- c(200 4000 100 400)end lt- c(250 410 200 450)mydata lt- dataframe(chr start end strand)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
names(mydata) lt- c(chr start end strand) variable namesmydata
chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +
OR this will work toomydata lt- dataframe(chr = chr start = start
end = end strand = strand)mydata
chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
There are a variety of ways to identify the elements of a data frame
mydata[24] columns 234 of data frame
start end strand 1 200 250 - 2 4000 410 - 3 100 200 + 4 400 450 +
columns chr and start from data framemydata[c(chr start)]
chr start 1 chr1 200 2 chr1 4000 3 chr2 100 4 chr2 400
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
mydata$start variable start in the data frame
[1] 200 4000 100 400
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name
example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)
mymatrix = matrix(14 ncol = 2) age = 53)w
$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Lists
You can extract elements of a list using the [[]] convention
w[[3]] 3rd component of the list
[1] [2] [1] 1 3 [2] 2 4
w[[mynumbers]] component named mynumbers in list
[1] 1 2 3
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
R has great support for plotting and customizing plots We will showonly a few below
x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
Saving plots
pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR
plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting help on R functionscommands
Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing packages
Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom
How to install packagesinstallpackages(devtools)
source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)
Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Genomics and R
Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()
read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt
header = FALSE)
write data frames to text fileswritetable(cpgidf exampleouttxt)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1
ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1
ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )
find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))
Histogram of width(cpgi)
width(cpgi)
Fre
quen
cy
0 2000 6000 10000
050
010
00
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
VectorsR handles vectors easily and intuitively
x lt- c(1 3 2 10 5) create a vector x with 5 componentsx
[1] 1 3 2 10 5
y lt- 15 create a vector of consecutive integersy
[1] 1 2 3 4 5
y + 2 scalar addition
[1] 3 4 5 6 7
2 y scalar multiplication
[1] 2 4 6 8 10
y^2 raise each component to the second power
[1] 1 4 9 16 25
2^y raise 2 to the first through fifth power
[1] 2 4 8 16 32
y y itself has not been unchanged
[1] 1 2 3 4 5
y lt- y 2y it is now changed
[1] 2 4 6 8 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Matrices
A matrix refers to a numeric array of rows and columns One of theeasiest ways to create a matrix is to combine vectors of equal lengthusing cbind() meaning column bind
x lt- c(1 2 3)y lt- c(4 5 6)m1 lt- cbind(x y)m1
x y [1] 1 4 [2] 2 5 [3] 3 6
t(m1) transpose of m1
[1] [2] [3] x 1 2 3 y 4 5 6
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Matrices
dim(m1) 2 by 3 matrix
[1] 3 2
You can also directly list the elements and specify the matrix
m2 lt- matrix(c(1 3 2 5 -1 2 2 3 9) nrow = 3)m2
[1] [2] [3] [1] 1 5 2 [2] 3 -1 3 [3] 2 2 9
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
A data frame is more general than a matrix in that different columnscan have different modes (numeric character factor etc) Small tomoderate size data frame can be constructed by dataframe()function For example we illustrate how to construct a data framefrom genomic intervals or coordinates
chr lt- c(chr1 chr1 chr2 chr2)strand lt- c(- - + +)start lt- c(200 4000 100 400)end lt- c(250 410 200 450)mydata lt- dataframe(chr start end strand)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
names(mydata) lt- c(chr start end strand) variable namesmydata
chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +
OR this will work toomydata lt- dataframe(chr = chr start = start
end = end strand = strand)mydata
chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
There are a variety of ways to identify the elements of a data frame
mydata[24] columns 234 of data frame
start end strand 1 200 250 - 2 4000 410 - 3 100 200 + 4 400 450 +
columns chr and start from data framemydata[c(chr start)]
chr start 1 chr1 200 2 chr1 4000 3 chr2 100 4 chr2 400
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
mydata$start variable start in the data frame
[1] 200 4000 100 400
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name
example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)
mymatrix = matrix(14 ncol = 2) age = 53)w
$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Lists
You can extract elements of a list using the [[]] convention
w[[3]] 3rd component of the list
[1] [2] [1] 1 3 [2] 2 4
w[[mynumbers]] component named mynumbers in list
[1] 1 2 3
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
R has great support for plotting and customizing plots We will showonly a few below
x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
Saving plots
pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR
plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting help on R functionscommands
Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing packages
Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom
How to install packagesinstallpackages(devtools)
source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)
Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Genomics and R
Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()
read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt
header = FALSE)
write data frames to text fileswritetable(cpgidf exampleouttxt)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1
ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1
ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )
find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))
Histogram of width(cpgi)
width(cpgi)
Fre
quen
cy
0 2000 6000 10000
050
010
00
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Matrices
A matrix refers to a numeric array of rows and columns One of theeasiest ways to create a matrix is to combine vectors of equal lengthusing cbind() meaning column bind
x lt- c(1 2 3)y lt- c(4 5 6)m1 lt- cbind(x y)m1
x y [1] 1 4 [2] 2 5 [3] 3 6
t(m1) transpose of m1
[1] [2] [3] x 1 2 3 y 4 5 6
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Matrices
dim(m1) 2 by 3 matrix
[1] 3 2
You can also directly list the elements and specify the matrix
m2 lt- matrix(c(1 3 2 5 -1 2 2 3 9) nrow = 3)m2
[1] [2] [3] [1] 1 5 2 [2] 3 -1 3 [3] 2 2 9
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
A data frame is more general than a matrix in that different columnscan have different modes (numeric character factor etc) Small tomoderate size data frame can be constructed by dataframe()function For example we illustrate how to construct a data framefrom genomic intervals or coordinates
chr lt- c(chr1 chr1 chr2 chr2)strand lt- c(- - + +)start lt- c(200 4000 100 400)end lt- c(250 410 200 450)mydata lt- dataframe(chr start end strand)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
names(mydata) lt- c(chr start end strand) variable namesmydata
chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +
OR this will work toomydata lt- dataframe(chr = chr start = start
end = end strand = strand)mydata
chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
There are a variety of ways to identify the elements of a data frame
mydata[24] columns 234 of data frame
start end strand 1 200 250 - 2 4000 410 - 3 100 200 + 4 400 450 +
columns chr and start from data framemydata[c(chr start)]
chr start 1 chr1 200 2 chr1 4000 3 chr2 100 4 chr2 400
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
mydata$start variable start in the data frame
[1] 200 4000 100 400
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name
example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)
mymatrix = matrix(14 ncol = 2) age = 53)w
$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Lists
You can extract elements of a list using the [[]] convention
w[[3]] 3rd component of the list
[1] [2] [1] 1 3 [2] 2 4
w[[mynumbers]] component named mynumbers in list
[1] 1 2 3
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
R has great support for plotting and customizing plots We will showonly a few below
x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
Saving plots
pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR
plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting help on R functionscommands
Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing packages
Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom
How to install packagesinstallpackages(devtools)
source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)
Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Genomics and R
Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()
read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt
header = FALSE)
write data frames to text fileswritetable(cpgidf exampleouttxt)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1
ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1
ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )
find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))
Histogram of width(cpgi)
width(cpgi)
Fre
quen
cy
0 2000 6000 10000
050
010
00
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Matrices
dim(m1) 2 by 3 matrix
[1] 3 2
You can also directly list the elements and specify the matrix
m2 lt- matrix(c(1 3 2 5 -1 2 2 3 9) nrow = 3)m2
[1] [2] [3] [1] 1 5 2 [2] 3 -1 3 [3] 2 2 9
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
A data frame is more general than a matrix in that different columnscan have different modes (numeric character factor etc) Small tomoderate size data frame can be constructed by dataframe()function For example we illustrate how to construct a data framefrom genomic intervals or coordinates
chr lt- c(chr1 chr1 chr2 chr2)strand lt- c(- - + +)start lt- c(200 4000 100 400)end lt- c(250 410 200 450)mydata lt- dataframe(chr start end strand)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
names(mydata) lt- c(chr start end strand) variable namesmydata
chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +
OR this will work toomydata lt- dataframe(chr = chr start = start
end = end strand = strand)mydata
chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
There are a variety of ways to identify the elements of a data frame
mydata[24] columns 234 of data frame
start end strand 1 200 250 - 2 4000 410 - 3 100 200 + 4 400 450 +
columns chr and start from data framemydata[c(chr start)]
chr start 1 chr1 200 2 chr1 4000 3 chr2 100 4 chr2 400
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
mydata$start variable start in the data frame
[1] 200 4000 100 400
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name
example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)
mymatrix = matrix(14 ncol = 2) age = 53)w
$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Lists
You can extract elements of a list using the [[]] convention
w[[3]] 3rd component of the list
[1] [2] [1] 1 3 [2] 2 4
w[[mynumbers]] component named mynumbers in list
[1] 1 2 3
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
R has great support for plotting and customizing plots We will showonly a few below
x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
Saving plots
pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR
plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting help on R functionscommands
Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing packages
Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom
How to install packagesinstallpackages(devtools)
source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)
Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Genomics and R
Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()
read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt
header = FALSE)
write data frames to text fileswritetable(cpgidf exampleouttxt)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1
ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1
ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )
find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))
Histogram of width(cpgi)
width(cpgi)
Fre
quen
cy
0 2000 6000 10000
050
010
00
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
A data frame is more general than a matrix in that different columnscan have different modes (numeric character factor etc) Small tomoderate size data frame can be constructed by dataframe()function For example we illustrate how to construct a data framefrom genomic intervals or coordinates
chr lt- c(chr1 chr1 chr2 chr2)strand lt- c(- - + +)start lt- c(200 4000 100 400)end lt- c(250 410 200 450)mydata lt- dataframe(chr start end strand)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
names(mydata) lt- c(chr start end strand) variable namesmydata
chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +
OR this will work toomydata lt- dataframe(chr = chr start = start
end = end strand = strand)mydata
chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
There are a variety of ways to identify the elements of a data frame
mydata[24] columns 234 of data frame
start end strand 1 200 250 - 2 4000 410 - 3 100 200 + 4 400 450 +
columns chr and start from data framemydata[c(chr start)]
chr start 1 chr1 200 2 chr1 4000 3 chr2 100 4 chr2 400
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
mydata$start variable start in the data frame
[1] 200 4000 100 400
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name
example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)
mymatrix = matrix(14 ncol = 2) age = 53)w
$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Lists
You can extract elements of a list using the [[]] convention
w[[3]] 3rd component of the list
[1] [2] [1] 1 3 [2] 2 4
w[[mynumbers]] component named mynumbers in list
[1] 1 2 3
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
R has great support for plotting and customizing plots We will showonly a few below
x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
Saving plots
pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR
plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting help on R functionscommands
Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing packages
Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom
How to install packagesinstallpackages(devtools)
source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)
Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Genomics and R
Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()
read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt
header = FALSE)
write data frames to text fileswritetable(cpgidf exampleouttxt)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1
ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1
ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )
find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))
Histogram of width(cpgi)
width(cpgi)
Fre
quen
cy
0 2000 6000 10000
050
010
00
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
names(mydata) lt- c(chr start end strand) variable namesmydata
chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +
OR this will work toomydata lt- dataframe(chr = chr start = start
end = end strand = strand)mydata
chr start end strand 1 chr1 200 250 - 2 chr1 4000 410 - 3 chr2 100 200 + 4 chr2 400 450 +
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
There are a variety of ways to identify the elements of a data frame
mydata[24] columns 234 of data frame
start end strand 1 200 250 - 2 4000 410 - 3 100 200 + 4 400 450 +
columns chr and start from data framemydata[c(chr start)]
chr start 1 chr1 200 2 chr1 4000 3 chr2 100 4 chr2 400
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
mydata$start variable start in the data frame
[1] 200 4000 100 400
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name
example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)
mymatrix = matrix(14 ncol = 2) age = 53)w
$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Lists
You can extract elements of a list using the [[]] convention
w[[3]] 3rd component of the list
[1] [2] [1] 1 3 [2] 2 4
w[[mynumbers]] component named mynumbers in list
[1] 1 2 3
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
R has great support for plotting and customizing plots We will showonly a few below
x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
Saving plots
pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR
plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting help on R functionscommands
Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing packages
Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom
How to install packagesinstallpackages(devtools)
source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)
Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Genomics and R
Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()
read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt
header = FALSE)
write data frames to text fileswritetable(cpgidf exampleouttxt)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1
ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1
ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )
find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))
Histogram of width(cpgi)
width(cpgi)
Fre
quen
cy
0 2000 6000 10000
050
010
00
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
There are a variety of ways to identify the elements of a data frame
mydata[24] columns 234 of data frame
start end strand 1 200 250 - 2 4000 410 - 3 100 200 + 4 400 450 +
columns chr and start from data framemydata[c(chr start)]
chr start 1 chr1 200 2 chr1 4000 3 chr2 100 4 chr2 400
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
mydata$start variable start in the data frame
[1] 200 4000 100 400
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name
example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)
mymatrix = matrix(14 ncol = 2) age = 53)w
$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Lists
You can extract elements of a list using the [[]] convention
w[[3]] 3rd component of the list
[1] [2] [1] 1 3 [2] 2 4
w[[mynumbers]] component named mynumbers in list
[1] 1 2 3
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
R has great support for plotting and customizing plots We will showonly a few below
x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
Saving plots
pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR
plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting help on R functionscommands
Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing packages
Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom
How to install packagesinstallpackages(devtools)
source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)
Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Genomics and R
Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()
read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt
header = FALSE)
write data frames to text fileswritetable(cpgidf exampleouttxt)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1
ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1
ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )
find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))
Histogram of width(cpgi)
width(cpgi)
Fre
quen
cy
0 2000 6000 10000
050
010
00
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Data Frames
mydata$start variable start in the data frame
[1] 200 4000 100 400
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name
example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)
mymatrix = matrix(14 ncol = 2) age = 53)w
$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Lists
You can extract elements of a list using the [[]] convention
w[[3]] 3rd component of the list
[1] [2] [1] 1 3 [2] 2 4
w[[mynumbers]] component named mynumbers in list
[1] 1 2 3
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
R has great support for plotting and customizing plots We will showonly a few below
x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
Saving plots
pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR
plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting help on R functionscommands
Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing packages
Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom
How to install packagesinstallpackages(devtools)
source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)
Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Genomics and R
Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()
read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt
header = FALSE)
write data frames to text fileswritetable(cpgidf exampleouttxt)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1
ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1
ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )
find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))
Histogram of width(cpgi)
width(cpgi)
Fre
quen
cy
0 2000 6000 10000
050
010
00
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
ListsAn ordered collection of objects (components) A list allows you togather a variety of (possibly unrelated) objects under one name
example of a list a string a numeric vector a matrix and a scalarw lt- list(name = Fred mynumbers = c(1 2 3)
mymatrix = matrix(14 ncol = 2) age = 53)w
$name [1] Fred $mynumbers [1] 1 2 3 $mymatrix [1] [2] [1] 1 3 [2] 2 4 $age [1] 53
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Lists
You can extract elements of a list using the [[]] convention
w[[3]] 3rd component of the list
[1] [2] [1] 1 3 [2] 2 4
w[[mynumbers]] component named mynumbers in list
[1] 1 2 3
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
R has great support for plotting and customizing plots We will showonly a few below
x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
Saving plots
pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR
plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting help on R functionscommands
Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing packages
Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom
How to install packagesinstallpackages(devtools)
source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)
Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Genomics and R
Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()
read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt
header = FALSE)
write data frames to text fileswritetable(cpgidf exampleouttxt)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1
ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1
ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )
find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))
Histogram of width(cpgi)
width(cpgi)
Fre
quen
cy
0 2000 6000 10000
050
010
00
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Lists
You can extract elements of a list using the [[]] convention
w[[3]] 3rd component of the list
[1] [2] [1] 1 3 [2] 2 4
w[[mynumbers]] component named mynumbers in list
[1] 1 2 3
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
R has great support for plotting and customizing plots We will showonly a few below
x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
Saving plots
pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR
plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting help on R functionscommands
Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing packages
Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom
How to install packagesinstallpackages(devtools)
source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)
Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Genomics and R
Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()
read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt
header = FALSE)
write data frames to text fileswritetable(cpgidf exampleouttxt)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1
ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1
ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )
find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))
Histogram of width(cpgi)
width(cpgi)
Fre
quen
cy
0 2000 6000 10000
050
010
00
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
R has great support for plotting and customizing plots We will showonly a few below
x lt- rnorm(50) sample 50 values from normal distrhist(x) plot the histogram of those valueshist(x main = Hello histogram) change the titley lt- rnorm(50) randomly sample 50 points from normal distrplot(x y) plot a scatter ploty lt- rnorm(50) randomly sample 50 points from normal distrboxplot(x y main = boxplots of random samples) plot boxplotsplot(x y main = scatterplot of random samples) scatter plots
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
Saving plots
pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR
plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting help on R functionscommands
Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing packages
Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom
How to install packagesinstallpackages(devtools)
source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)
Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Genomics and R
Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()
read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt
header = FALSE)
write data frames to text fileswritetable(cpgidf exampleouttxt)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1
ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1
ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )
find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))
Histogram of width(cpgi)
width(cpgi)
Fre
quen
cy
0 2000 6000 10000
050
010
00
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Graphics in R
Saving plots
pdf(file = mygraphsmyplotpdf)plot(x y)devoff() OR
plot(x y)devcopy(pdf file = mygraphsmyplotpdf)devoff()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting help on R functionscommands
Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing packages
Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom
How to install packagesinstallpackages(devtools)
source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)
Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Genomics and R
Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()
read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt
header = FALSE)
write data frames to text fileswritetable(cpgidf exampleouttxt)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1
ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1
ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )
find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))
Histogram of width(cpgi)
width(cpgi)
Fre
quen
cy
0 2000 6000 10000
050
010
00
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting help on R functionscommands
Most R functions have great documentation on how to use them Try and will pull the documentation on the functions and willfind help pages on a vague topic Try on R terminalhisthistogram
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing packages
Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom
How to install packagesinstallpackages(devtools)
source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)
Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Genomics and R
Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()
read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt
header = FALSE)
write data frames to text fileswritetable(cpgidf exampleouttxt)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1
ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1
ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )
find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))
Histogram of width(cpgi)
width(cpgi)
Fre
quen
cy
0 2000 6000 10000
050
010
00
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing packages
Where R packages liveCRAN httpcranr-projectorgBioconductor httpbioconductororgR-forge httpr-forger-projectorgGithub httpgithubcomGooglecode httpcodegooglecom
How to install packagesinstallpackages(devtools)
source(httpbioconductororgbiocLiteR)biocLite(GenomicRanges)
Install from the sourceinstallpackages(devtools_11targzrepos=NULLtype=source)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Genomics and R
Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()
read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt
header = FALSE)
write data frames to text fileswritetable(cpgidf exampleouttxt)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1
ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1
ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )
find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))
Histogram of width(cpgi)
width(cpgi)
Fre
quen
cy
0 2000 6000 10000
050
010
00
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Genomics and R
Bioconductor and CRAN has many packages that will help analyzegenomics data Some of the most popular ones are GenomicRangesIRanges Rsamtools and BSgenome Check outhttpbioconductororg for packages that suits your purpose
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()
read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt
header = FALSE)
write data frames to text fileswritetable(cpgidf exampleouttxt)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1
ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1
ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )
find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))
Histogram of width(cpgi)
width(cpgi)
Fre
quen
cy
0 2000 6000 10000
050
010
00
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
Most of the genomics data are in the form of genomic intervalsassociated with a score That means mostly the data will be in tableformat with columns denoting chromosome start positions endpositions strand and score One of the popular formats is BEDformat In R you can easily read tabular format data withreadtable() function In addition you can write data to a text filewith writetable()
read enhancer marker BED fileenhdf = readtable(subsetenhancersbed header = FALSE) read CpG island BED filecpgidf = readtable(subsetcpgihg18bedtxt
header = FALSE)
write data frames to text fileswritetable(cpgidf exampleouttxt)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1
ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1
ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )
find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))
Histogram of width(cpgi)
width(cpgi)
Fre
quen
cy
0 2000 6000 10000
050
010
00
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the genomics data
check first lines to see how the data looks likehead(enhdf 2)head(cpgidf 2) get CpG islands on chr21head(cpgidf[cpgidf$V1 == chr21 ] 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1
ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1
ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )
find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))
Histogram of width(cpgi)
width(cpgi)
Fre
quen
cy
0 2000 6000 10000
050
010
00
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
One of the most useful operations when working with genomicintervals is the overlap operationbioconductor packages IRanges and GenomicRanges provideefficient ways to handle genomic interval dataBelow we will show how to convert your data to GenomicRangesobjectsdata structures and do overlap between enhancers andCpG islands
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1
ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1
ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )
find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))
Histogram of width(cpgi)
width(cpgi)
Fre
quen
cy
0 2000 6000 10000
050
010
00
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
covert enhancer data frame to GenomicRanges objectlibrary(GenomicRanges) load the packageenh lt- GRanges(seqnames=enhdf$V1
ranges=IRanges(start=enhdf$V2end=enhdf$V3) )cpgi = GRanges(seqnames=cpgidf$V1
ranges=IRanges(start=cpgidf$V2end=cpgidf$V3)ids=cpgidf$V4 )
find enhancers overlapping with CpG islandscpgenh=subsetByOverlaps(enh cpgi)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))
Histogram of width(cpgi)
width(cpgi)
Fre
quen
cy
0 2000 6000 10000
050
010
00
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Using GenomicRanges package for operations ongenomic intervals
number of enhancers overlapping with CpG islandslength(cpgenh) number of all enhancers in the setlength(enh) plot histogram of lenths of CpG islandshist(width(cpgi))
Histogram of width(cpgi)
width(cpgi)
Fre
quen
cy
0 2000 6000 10000
050
010
00
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Methylation profile by Reduced RepresentationBisulfite Sequencing (RRBS)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Installing methylKit
In R terminal install the package and the dependencies by typing thefollowing commands
source(httpmethylkitgooglecodecomfilesinstallmethylKitR) installing the package with the dependenciesinstallmethylKit(ver = 056 dependencies = TRUE)
seehttpscodegooglecompmethylkitInstallationfor more details on installing methylKit
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Flowchart for capabilities of methylKit
Aligned reads to the genome
Methylation per base
Sample merge
Differential methylation
Descriptive statistics
Correlation
Clustering and PCA
Filtering by coverage
Windowregion based methylation
scores
VisualizationAnnotation Coercion to bioconductor and R
objectsFurther analysis by
user
Input option 1
Input option 2
readbismark()
read()
filterByCoverage()
getMethylationStats()getCoverageStats()
regionCounts() tileMethylCounts()
unite()
getCorrelation()
clusterSamples()
PCASamples()
annotateWithFeature()and similar functions
calculateDiffMeth()
bedgraph() diffMethPerChr()
as() getData()
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Now methylKit is installed we will load the package and start byreading in the methylation call data The methylation call files arebasically text files that contain percent methylation score per base Atypical methylation call file looks like this
chrBase chr base strand coverage 1 chr2012314 chr20 12314 F 21 2 chr2012378 chr20 12378 R 13 3 chr2012382 chr20 12382 R 13 4 chr2012323 chr20 12323 F 21 5 chr2055740 chr20 55740 R 53 freqC freqT 1 9524 476 2 7692 2308 3 3846 6154 4 8571 1429 5 6792 3208
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Most of the time bisulfite sequencing experiments have test andcontrol samples The test samples can be from a disease tissue whilethe control samples can be from a healthy tissue You can read a setof methylation call files that have testcontrol conditions givingtreatment vector option as follows
library(methylKit)filelist = list(test1CpGtxt test2CpGtxt
ctrl1CpGtxt ctrl2CpGtxt) read the files to a methylRawList object myobjmyobj = read(filelist sampleid = list(test1
test2 ctrl1 ctrl2) assembly = hg18treatment = c(1 1 0 0) context = CpG)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Reading the methylation data
Another way to read the methylation calls is directly from theSAM files The SAM files must be sorted and only SAM files fromBismark aligner is supported at the moment Checkreadbismark() function help to learn more about this
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the datawe can check the basic stats about the methylation data such ascoverage and percent methylation We now have a methylRawListobject which contains methylation information per sample Thefollowing command prints out percent methylation statistics forsecond sample test2
getMethylationStats(myobj[[2]] plot = F bothstrands = F)
methylation statistics per base summary Min 1st Qu Median Mean 3rd Qu 000 000 248 3400 8180 Max 10000 percentiles 0 10 20 30 40 0000 0000 0000 0000 0000 50 60 70 80 90 2479 30986 70000 89583 100000 95 99 995 999 100 100000 100000 100000 100000 100000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the dataThe following command plots the histogram for percent methylationdistributionThe figure below is the histogram and numbers on barsdenote what percentage of locations are contained in that binTypically percent methylation histogram should have two peaks onboth ends
getMethylationStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG methylation
methylation per base
Fre
quen
cy
0 20 40 60 80 100
020
000
4000
060
000
8000
0
529
2912 12 08 09 09 12 1 16 1 14 14 16 2 23 27 38
58
134
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Quality check and basic features of the data
We can also plot the read coverage per base information in a similarway Experiments that are highly suffering from PCR duplication biaswill have a secondary peak towards the right hand side of thehistogram
getCoverageStats(myobj[[2]] plot = T bothstrands = F)
Histogram of CpG coverage
log10 of read coverage per base
Fre
quen
cy
10 15 20 25 30 35 40 45
050
0010
000
1500
020
000
2500
030
000
3500
0 222
177
125
139
168
144
2
03 0 0 0 0 0 0 0 0 0 0
test2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Filtering samples based on read coverage
It might be useful to filter samples based on coverageThe codebelow filters a methylRawList and discards bases that havecoverage below 10X and also discards the bases that have more than999th percentile of coverage in each sample
filteredmyobj = filterByCoverage(myobj locount = 10loperc = NULL hicount = NULL hiperc = 999)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationUniting data sets
In order to do further analysis we will need to get the bases coveredin all samples The following function will merge all samples to oneobject for base-pair locations that are covered in all samples
meth = unite(myobj destrand = FALSE) meth is a methylBase object
Let us take a look at the data content of methylBase object
head(meth 2)
id chr start end 1 chr2010100840 chr20 10100840 10100840 2 chr2010100844 chr20 10100844 10100844 strand coverage1 numCs1 numTs1 coverage2 1 + 29 2 27 11 2 + 29 2 27 11 numCs2 numTs2 coverage3 numCs3 numTs3 1 0 11 43 2 41 2 1 10 42 2 40 coverage4 numCs4 numTs4 1 10 0 10 2 10 0 10
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Sample CorrelationThis function will either plot scatter plot and correlation coefficients orjust print a correlation matrix
getCorrelation(meth plot = T)
test1
00 04 08
092 091
00 04 08
00
04
08
091
00
04
08
00 04 08
00
04
08
x
y
test2
089 089
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
ctrl1
00
04
08
097
00 04 08
00
04
08
00 04 08
00
04
08
x
y
00 04 08
00
04
08
x
y
00 04 0800 04 08
00
04
08
x
y
ctrl2
CpG base correlation
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
methylKit can do hiearchical clustering
clusterSamples(meth dist = correlation method = wardplot = TRUE)
000
005
010
015
020
025
030
035
CpG methylation clustering
Distance method correlation Clustering method wardSamples
Hei
ght
ctrl1
ctrl2
test
1
test
2
Call hclust(d = d method = HCLUSTMETHODS[hclustmethod]) Cluster method ward Distance pearson Number of objects 4
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofiles
Principal Component Analysis (PCA) is another option We can plotPC1 (principal component 1) and PC2 (principal component 2) axisand a scatter plot of our samples on those axis which will reveal howthey cluster
PCASamples(meth)
minus100 minus50 0 50 100
minus10
0minus
500
5010
015
0
CpG methylation PCA Analysis
PC1
PC
2
test1
test2
ctrl1ctrl2
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Clustering your samples based on the methylationprofilesclustering multiple conditions
It is possible to cluster samples from more than 2 conditions Thecode below will not work since the tutorial does not have an exampledataset but if you happen to work with multiple conditions you canstill cluster them in the way shown belowThe members of the clusterwill be color coded based on the treatment option
filelist = list(condA_sample1txt condA_sample2txtcondB_sample1txt condB_sample1txtcondC_sample1txt condC_sample1txt)
clist = read(filelist sampleid = list(A1A2 B1 B2 C1 C2) assembly = hg18treatment = c(2 2 1 1 0 0) context = CpG)
newMeth = unite(clist)clusterSamples(newMeth)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
calculateDiffMeth() function is the main function tocalculate differential methylationDepending on the sample size per each set it will either useFisherrsquos exact or logistic regression to calculate P-valuesP-values will be adjusted to Q-values
myDiff = calculateDiffMeth(meth)
calculateDiffMeth() can also use multiple cores for fastercalculation
myDiff = calculateDiffMeth(meth numcores = 2)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated bases
Following bit selects the bases that have q-valuelt001 and percentmethylation difference larger than 25
get hyper methylated basesmyDiff25phyper = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hyper) get hypo methylated basesmyDiff25phypo = getmethylDiff(myDiff difference = 25
qvalue = 001 type = hypo) get all differentially methylated basesmyDiff25p = getmethylDiff(myDiff difference = 25
qvalue = 001)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Getting differentially methylated regions
methylKit can summarize methylation information over tilingwindows or over a set of predefined regions (promoters CpG islandsintrons etc) rather than doing base-pair resolution analysis
you might get warnings here ignore themtiles = tileMethylCounts(myobj winsize = 1000
stepsize = 1000)
head(tiles[[1]] 2)
id chr start end strand 1 chr201200113000 chr20 12001 13000 2 chr205500156000 chr20 55001 56000 coverage numCs numTs 1 68 53 15 2 212 192 20
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Differential methylation events per chromosomeWe can also visualize the distribution of hypohyper-methylatedbasesregions per chromosome using the following function
if plot=FALSE a list having per chromosome differentially methylation events will be returneddiffMethPerChr(myDiff plot = FALSE qvaluecutoff = 001
methcutoff = 25)
if plot=TRUE a barplot will be plotteddiffMethPerChr(myDiff plot = TRUE qvaluecutoff = 001
methcutoff = 25)
chr20
chr21
chr22
of hyper amp hypo methylated regions per chromsome
(percentage)
0 2 4 6 8 10
qvaluelt001 amp methylation diff gt=25
hyperhypo
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation eventsWe can annotate our differentially methylated regionsbases basedon gene annotation We need to read the gene annotation from a bedfile and annotate our differentially methylated regions with thatinformation Similar gene annotation can be created usingGenomicFeatures package available from Bioconductor
geneobj lt- readtranscriptfeatures(subsetrefseqhg18bedtxt) annotate differentially methylated Cs with promoterexonintron using annotation dataannotateWithGenicParts(myDiff25p geneobj)
summary of target set annotation with genic parts 8009 rows in target set -------------- -------------- percentage of target features overlapping with annotation promoter exon intron intergenic 1443 1904 4033 3857 percentage of target features overlapping with annotation (with promotergtexongtintron precedence) promoter exon intron intergenic 1443 1411 3289 3857 percentage of annotation boundaries with feature overlap promoter exon intron 20423 3851 10507 summary of distances to the nearest TSS Min 1st Qu Median Mean 3rd Qu 0 2440 10200 30800 32300 Max 952000
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
Similarly we can read the CpG island annotation and annotate ourdifferentially methylated basesregions with them
read the shores and flanking regions and name the flanks as shores and CpG islands as CpGicpgobj = readfeatureflank(subsetcpgihg18bedtxt
featureflankname = c(CpGi shores))diffCpGann = annotateWithFeatureFlank(myDiff25p
cpgobj$CpGi cpgobj$shores featurename = CpGiflankname = shores)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Annotating differential methylation events
We can also read any BED file and annotate our differentiallymethylated basesregions with them
read the enhancer marker bed fileenhb = readbed(subsetenhancersbed)
diffCpGenh = annotateWithFeature(myDiff25penhb featurename = enhancers)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
After getting the annotation of differentially methylated regions wecan get the distance to TSS and nearest gene name using thegetAssociationWithTSS function
diffAnn = annotateWithGenicParts(myDiff25p geneobj)
targetrow is the row number in myDiff25phead(getAssociationWithTSS(diffAnn) 3)
targetrow disttofeature featurename 427 1 -121458 NM_000214 428 2 271458 NR_038972 310 3 -1871 NM_018354 featurestrand 427 - 428 - 310 -
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation events
It is also desirable to get percentagenumber of differentiallymethylated regions that overlap with intronexonpromoters
getTargetAnnotationStats(diffAnn percentage = TRUEprecedence = TRUE)
promoter exon intron intergenic 1443 1411 3289 3857
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the percentage of differentially methylated basesoverlapping with exonintronpromoters
plotTargetAnnotation(diffAnn precedence = TRUEmain = differential methylation annotation)
14
14
33
39
differential methylation annotation
promoterexonintronintergenic
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Working with annotated methylation eventsWe can also plot the CpG island annotation the same way The plotbelow shows what percentage of differentially methylated bases areon CpG islands CpG island shores and other regions
plotTargetAnnotation(diffCpGann col = c(greengray white) main = differential methylation annotation)
22
23
55
differential methylation annotation
CpGishoresother
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Regional analysis
We can also summarize methylation information over a set of definedregions such as promoters The function below summarizes themethylation information over a given set of promoter regions andoutputs a methylRaw or methylRawList object depending on theinput
promoters = regionCounts(myobj geneobj$promoters)
head(promoters[[1]] 3)
id chr start 1 chr203350563633507636NA chr20 33505636 2 chr2080602958062295NA chr20 8060295 3 chr2031370553139055NA chr20 3137055 end strand coverage numCs numTs 1 33507636 + 727 4 723 2 8062295 + 954 15 939 3 3139055 + 553 2 551
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Most methylKit objects (methylRawmethylBase and methylDiff)can be coerced to GRanges objects from GenomicRanges packageCoercing methylKit objects to GRanges will give users additionalflexiblity when customising their analyses
class(meth)as(meth GRanges)class(myDiff)as(myDiff GRanges)
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
We can also select rows from methylRaw methylBase and methylDiffobjects with select function An appropriate methylKit object will bereturned as a result of select function
select(meth 15) select first 5 rows ORmeth[15 ] select first 5 rows
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
methylKit convenience functions
Another useful function is getData() You can use this function toextract data frames from methylRaw methylBase and methylDiffobjects
get data frame and show first rowshead(getData(meth)) get data frame and show first rowshead(getData(myDiff))
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
information on RRBS and alikeRRBS httpwwwnaturecomnprotjournalv6n4absnprot2010190html
Agilent methyl-seq capture httpwwwhalogenomicscomsureselectmethyl-seq
Some of the aligners and pipelines
Bismark httpwwwbioinformaticsbbsrcacukprojectsbismark
AMP pipeline httpcodegooglecompamp-errbs
BSMAP httpcodegooglecompbsmap
other aligners and methods reviewed herehttpwwwnaturecomnmethjournalv9n2fullnmeth1828html
Akalın Altuna Analyzing RRBS methylation data with R
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-
IntroductionR Basics
Genomics and RRRBS analysis with R package methylKit
Further information
More info on methylKit
The webpage for the package ishttpmethylkitgooglecodecom
Genome Biology paper for the package is athttpgenomebiologycom20121310R87
Blog posts related to methylKit are at httpzvfakblogspotcomsearchlabelmethylKit
The slides for this presentation is herehttpmethylkitgooglecodecomfilesmethylKitTutorialSlides_2013pdf
The tutorial for the 2012 EpiWorkshop is herehttpmethylkitgooglecodecomfilesmethylKitTutorial_feb2012pdf
Akalın Altuna Analyzing RRBS methylation data with R
- Introduction
- R Basics
- Genomics and R
- RRBS analysis with R package methylKit
-