justincook__cytosinemethylationrate
TRANSCRIPT
Determining Cytosine Methylation Rate of the Whole Genome and
Chloroplast Genome in Zea mays through Bismark AlignmentJustin Cook
University of Guelph, 50 Stone Rd E, Guelph ON, N1G 2W1
DNA methylation describes the addition of a methyl group (CH3) to
a nucleotide on a DNA strand. In plants and animals this occurs
exclusively through cytosine methylation. This epigenetic process is
associated with regulation of gene transcription as well the silencing of
transposable elements and repetitive regions in the genome. When the
methyl group is added to the DNA strand, molecular machinery that
would typically attach to the DNA to initiate gene transcription are
obstructed by the methyl group, and as a result are unable to bind.
Cytosine methylation is known to occur in three different sequence
contexts: CpG, CHG, and CHH methylation, where the H represents
any genomic nucleotide aside from a guanine. Methylation is
regulated by a group of enzymes called DNA methyltransferases. These
methyltransferases are specific to sequence contexts. In plants CpG
methylation is regulated by the enzyme MET1, while CHG and CHH
methylation are regulated by DRM1, DRM2, and CMT3.
The methylome describes the entirety of the methylation
modifications in an organisms genome. The methylation status of a
genome can be resolved using next-generation sequencing
techniques coupled with the treatment of sodium bisulphite (NaHSO4).
This technique, termed Bisulphite sequencing, exploits the action of
sodium bisulphite on unmethylated cytosines. Treatment of DNA with
sodium bisulphite converts unmethylated cytosines to the nucleotide
uracil, which are interpreted as thymine by sequencers, while
methylated cytosines remain.
The primary objective was to find the cytosine methylation rates of
both the chloroplast and entire genome of Zea mays. Two genome
species were used: the EF09B data provided by Lukens lab and the B73
genome found on Ensemble. The unmethylated chloroplast genome
was used to infer the false-positive cytosine methylation rate of the
alignment program. The secondary objective of this project was to
develop a functional workflow for Bisulphite-sequencing that utilizes
modern quality-trimming and aligning programs.
Trim Galore!Next-generation sequencing platforms use custom sequence adaptors that are used to perform
amplification of the sequences. These adaptors can remain as artifacts in the data that these platforms
produce. Trim Galore! is able to perform both quality- and adaptor-trimming of data through the command-
line tool Cutadapt. Trim Galore! first removes low quality base calls from the 3’ end of sequence reads before
adaptor removal. The user can specify the next-generation sequencing platform used or they can input their
own adaptor sequence if applicable. Trim Galore! is capable of using single- or paired-end data. For the Zea mays analyses the Illumina adaptor sequence ‘AGATCGGAAGAGC’ was used on paired-end FastQ files.
FastQCQuality screening is important to ensure that the next-generation sequencing data is appropriate for
subsequent analyses. FastQC is a simple screening tool that provides a comprehensive report after the
analysis. The report includes information on the per-base and per-sequence quality scores, sequence duplication levels, and adaptor content.
The cytosine methylation rates of both chloroplast genomes were
0.6% for CpG and CHG contexts and 0.1% for CHH contexts.
Approximately 195 million sequence pairs were processed at 6.1%
mapping efficiency. Just over 350 million cytosines were processed
during the B73 alignment and 349 million cytosines were processed
during the EF09B alignment.
The B73 genome alignment reported cytosine methylation rates of
87.0%, 73.4% and 2.9% for CpG, CHG and CHH contexts, respectively.
Unknown contexts were methylated at a rate of 17.4%. 195 million
sequence pairs were processed at 15.2% mapping efficiency.
Approximately 1 billion cytosine residues were analyzed during this
alignment.
EF09B alignment reported 87.8% CpG, 74.9% CHG and 2.9% CHH
methylation rates. Unknown contexts were methylated 17.3% of the
time. 195 million sequence pairs were processed at 59.3% mapping efficiency, corresponding to just over 4 billion cytosines.
The false-positive rates of 0.6% for CpG and CHG contexts and 0.1%
for CHH contexts fall right between those reported in other studies
(Hardcastle 2013; Regulski et al. 2013). The cytosine methylation rates for
the Maize genome have previously been reported at over 70% for both
CpG and CHG contexts and under 5% in CHH contexts (Regulski et al.
2013), which match the alignment results found in this analysis.
The alignment of the B73 genome with the lab-generated data
reported a mapping efficiency of only 15.2%. This is likely a result of an
alignment between reads and a non-native reference genome that did
not correspond. This was reflected in the fact that only 1 billion cytosines
were analyzed for the B73 alignment while 4 billion cytosines were
analyzed for the EF09B alignment.
A large percentage of cytosines in an unknown context were
reported as being methylated. This describes methylation that did not
occur in one of the three known contexts. The 2008 study by Cokus et al.
suggested that sequence contexts may exist beyond just the CpG, CHG
and CHH forms. Their study examined all 7-mer context possibilities and
found some interesting results that suggest that additional contexts may
be present. Future DNA methylation studies could examine the
possibility of additional context specificity.
False-positive cytosine methylation rates were found using the
unmethylated Zea mays chloroplast genome. CpG and CHG
methylation rates were 0.6% and CHH methylation was 0.1%.
Whole genome methylation rates of Zea mays were slightly different
between the B73 and EF09B genome. B73 rates were 87.0%, 73.4% and
2.9% in CpG, CHG and CHH contexts. EF09B rates were 87.8%, 74.9%
and 2.9% in CpG, CHG and CHH contexts.
The proposed workflow for bisulphite sequencing includes the use
of Trim Galore! for quality- and adaptor sequencing, FastQC for quality
screening and Bismark for bisulphite mapping.
Figure 1 – Cytosine Methylation involves DNA
methyltransferases adding a methyl group (CH3) to the C5
position of a cytosine, producing 5-methylcytosine. Image
taken from Google Images.
Figure 2 – Treatment of unmethylated cytosines with sodium bisulphate (NaHSO4) results in
the conversion of the cytosine to the nucleotide uracil while methylated cytosines are
unaffected. Uracil is interpreted as a thymine by the sequencer. Image taken from
Google Images.
Figure 3 – Generalized workflow for Bisulphite Sequencing. Samtools is an optional tool that can utilized if
Bismark does not output a SAM file. Blue arrows indicate the path of the workflow while red arrows indicate
when an output will be generated.
Table 1 – B73 Genome and
Chloroplast alignment summary
Table 2 – EF09B Genome and
Chloroplast alignment summary
Treatment of DNA with sodium bisulphite reduces the complexity
of the genome significantly. This is due to the conversion of the
cytosines to uracil (which is interpreted as thymine). Standard
genomic aligners can be confounded by this 3-letter genome,
particularly in repetitive regions where cytosine methylation is
known to occur at an increased frequency. Alignment programs
that are designed to work with bisulphite-treated data are
therefore a critical choice during bisulphite sequencing.
The bisulphite mapper Bismark is a program that has been
designed specifically for bisulphite sequencing alignment. Before
alignment can be performed, Bismark requires the user to supply a
reference genome that the transforms into two files: A version
where all sequence cytosines are changed to thymines (C-to-T)
and a second version where all guanines are transformed to
adenines (G-to-A). This provides a bisulphite-transformed version of
the reference genome that accommodates both the top (C-to-T)
and bottom (G-to-A) strands of the bisulphite-converted reads.
Bismark also converts each read into a C-to-T and G-to-A version of
itself.
Following the conversion of the reference genome is the alignment itself. Bismark performs 4 parallel
alignments of each read in the sequence file; The C-to-T and G-to-A reads are each aligned to the C-to-T and
G-to-A genomes. The single best alignment of these 4 alignments, based on sequence mismatches, is kept
while the other 3 are discarded. The parallel alignments allow Bismark to determine the directionality of the
sequences without requiring strand specificity as an input.
The output of Bismark alignment is a final report that summarizes important statistics from the analysis. This
includes the number of sequence pairs and cytosines analyzed, the mapping efficiency and the methylation
rates in different sequence contexts. Bismark also outputs a BAM or SAM file that can be imported into a
Genome browser for visualization of the methylome. A final tool, Bismark Methylation Extractor, can be used to extract the position of every cytosine in a context-dependent manner using the SAM output.
Figure 4 – Bismark alignment. Reads are converted to
C-to-T and G-to-A versions and aligned to
equivalently converted versions of the reference
genome. The unique best alignment is selected and
the remaining three are discarded. Image from
Krueger & Andrews 2011.
Bismark
MethodologyBackground
Objectives
Results
Discussion
Conclusions
References• Andrews, S. (2010). FastQC: A quality control tool for high throughput sequence data. Reference Source.• Cokus, S. J., Feng, S., Zhang, X., Chen, Z., Merriman, B., Haudenschild, C. D., ... & Jacobsen, S. E. (2008). Shotgun bisulphite sequencing of the
Arabidopsis genome reveals DNA methylation patterning. Nature, 452(7184), 215-219. • Feng, S., Cokus, S. J., Zhang, X., Chen, P. Y., Bostick, M., Goll, M. G., ... & Ukomadu, C. (2010). Conservation and divergence of methylation
patterning in plants and animals. Proceedings of the National Academy of Sciences, 107(19), 8689-8694. • Hardcastle, T. J. (2013). High-throughput sequencing of cytosine methylation in plant DNA. Plant methods, 9(1), 1. • Krueger, F., & Andrews, S. R. (2011). Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics, 27(11),
1571-1572. • Krueger, F. (2015). Trim Galore. A wrapper tool around Cutadapt and FastQC to consistently apply quality and adapter trimming to FastQ files. • Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. journal, 17(1), pp-10. • Regulski, M., Lu, Z., Kendall, J., Donoghue, M. T., Reinders, J., Llaca, V., ... & Tingey, S. (2013). The maize methylome influences mRNA splice
sites and reveals widespread paramutation-like switches guided by small RNA.Genome research, 23(10), 1651-1662.