justincook__cytosinemethylationrate

1

Click here to load reader

Upload: justin-cook

Post on 15-Apr-2017

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: JustinCook__CytosineMethylationRate

Determining Cytosine Methylation Rate of the Whole Genome and

Chloroplast Genome in Zea mays through Bismark AlignmentJustin Cook

University of Guelph, 50 Stone Rd E, Guelph ON, N1G 2W1

DNA methylation describes the addition of a methyl group (CH3) to

a nucleotide on a DNA strand. In plants and animals this occurs

exclusively through cytosine methylation. This epigenetic process is

associated with regulation of gene transcription as well the silencing of

transposable elements and repetitive regions in the genome. When the

methyl group is added to the DNA strand, molecular machinery that

would typically attach to the DNA to initiate gene transcription are

obstructed by the methyl group, and as a result are unable to bind.

Cytosine methylation is known to occur in three different sequence

contexts: CpG, CHG, and CHH methylation, where the H represents

any genomic nucleotide aside from a guanine. Methylation is

regulated by a group of enzymes called DNA methyltransferases. These

methyltransferases are specific to sequence contexts. In plants CpG

methylation is regulated by the enzyme MET1, while CHG and CHH

methylation are regulated by DRM1, DRM2, and CMT3.

The methylome describes the entirety of the methylation

modifications in an organisms genome. The methylation status of a

genome can be resolved using next-generation sequencing

techniques coupled with the treatment of sodium bisulphite (NaHSO4).

This technique, termed Bisulphite sequencing, exploits the action of

sodium bisulphite on unmethylated cytosines. Treatment of DNA with

sodium bisulphite converts unmethylated cytosines to the nucleotide

uracil, which are interpreted as thymine by sequencers, while

methylated cytosines remain.

The primary objective was to find the cytosine methylation rates of

both the chloroplast and entire genome of Zea mays. Two genome

species were used: the EF09B data provided by Lukens lab and the B73

genome found on Ensemble. The unmethylated chloroplast genome

was used to infer the false-positive cytosine methylation rate of the

alignment program. The secondary objective of this project was to

develop a functional workflow for Bisulphite-sequencing that utilizes

modern quality-trimming and aligning programs.

Trim Galore!Next-generation sequencing platforms use custom sequence adaptors that are used to perform

amplification of the sequences. These adaptors can remain as artifacts in the data that these platforms

produce. Trim Galore! is able to perform both quality- and adaptor-trimming of data through the command-

line tool Cutadapt. Trim Galore! first removes low quality base calls from the 3’ end of sequence reads before

adaptor removal. The user can specify the next-generation sequencing platform used or they can input their

own adaptor sequence if applicable. Trim Galore! is capable of using single- or paired-end data. For the Zea mays analyses the Illumina adaptor sequence ‘AGATCGGAAGAGC’ was used on paired-end FastQ files.

FastQCQuality screening is important to ensure that the next-generation sequencing data is appropriate for

subsequent analyses. FastQC is a simple screening tool that provides a comprehensive report after the

analysis. The report includes information on the per-base and per-sequence quality scores, sequence duplication levels, and adaptor content.

The cytosine methylation rates of both chloroplast genomes were

0.6% for CpG and CHG contexts and 0.1% for CHH contexts.

Approximately 195 million sequence pairs were processed at 6.1%

mapping efficiency. Just over 350 million cytosines were processed

during the B73 alignment and 349 million cytosines were processed

during the EF09B alignment.

The B73 genome alignment reported cytosine methylation rates of

87.0%, 73.4% and 2.9% for CpG, CHG and CHH contexts, respectively.

Unknown contexts were methylated at a rate of 17.4%. 195 million

sequence pairs were processed at 15.2% mapping efficiency.

Approximately 1 billion cytosine residues were analyzed during this

alignment.

EF09B alignment reported 87.8% CpG, 74.9% CHG and 2.9% CHH

methylation rates. Unknown contexts were methylated 17.3% of the

time. 195 million sequence pairs were processed at 59.3% mapping efficiency, corresponding to just over 4 billion cytosines.

The false-positive rates of 0.6% for CpG and CHG contexts and 0.1%

for CHH contexts fall right between those reported in other studies

(Hardcastle 2013; Regulski et al. 2013). The cytosine methylation rates for

the Maize genome have previously been reported at over 70% for both

CpG and CHG contexts and under 5% in CHH contexts (Regulski et al.

2013), which match the alignment results found in this analysis.

The alignment of the B73 genome with the lab-generated data

reported a mapping efficiency of only 15.2%. This is likely a result of an

alignment between reads and a non-native reference genome that did

not correspond. This was reflected in the fact that only 1 billion cytosines

were analyzed for the B73 alignment while 4 billion cytosines were

analyzed for the EF09B alignment.

A large percentage of cytosines in an unknown context were

reported as being methylated. This describes methylation that did not

occur in one of the three known contexts. The 2008 study by Cokus et al.

suggested that sequence contexts may exist beyond just the CpG, CHG

and CHH forms. Their study examined all 7-mer context possibilities and

found some interesting results that suggest that additional contexts may

be present. Future DNA methylation studies could examine the

possibility of additional context specificity.

False-positive cytosine methylation rates were found using the

unmethylated Zea mays chloroplast genome. CpG and CHG

methylation rates were 0.6% and CHH methylation was 0.1%.

Whole genome methylation rates of Zea mays were slightly different

between the B73 and EF09B genome. B73 rates were 87.0%, 73.4% and

2.9% in CpG, CHG and CHH contexts. EF09B rates were 87.8%, 74.9%

and 2.9% in CpG, CHG and CHH contexts.

The proposed workflow for bisulphite sequencing includes the use

of Trim Galore! for quality- and adaptor sequencing, FastQC for quality

screening and Bismark for bisulphite mapping.

Figure 1 – Cytosine Methylation involves DNA

methyltransferases adding a methyl group (CH3) to the C5

position of a cytosine, producing 5-methylcytosine. Image

taken from Google Images.

Figure 2 – Treatment of unmethylated cytosines with sodium bisulphate (NaHSO4) results in

the conversion of the cytosine to the nucleotide uracil while methylated cytosines are

unaffected. Uracil is interpreted as a thymine by the sequencer. Image taken from

Google Images.

Figure 3 – Generalized workflow for Bisulphite Sequencing. Samtools is an optional tool that can utilized if

Bismark does not output a SAM file. Blue arrows indicate the path of the workflow while red arrows indicate

when an output will be generated.

Table 1 – B73 Genome and

Chloroplast alignment summary

Table 2 – EF09B Genome and

Chloroplast alignment summary

Treatment of DNA with sodium bisulphite reduces the complexity

of the genome significantly. This is due to the conversion of the

cytosines to uracil (which is interpreted as thymine). Standard

genomic aligners can be confounded by this 3-letter genome,

particularly in repetitive regions where cytosine methylation is

known to occur at an increased frequency. Alignment programs

that are designed to work with bisulphite-treated data are

therefore a critical choice during bisulphite sequencing.

The bisulphite mapper Bismark is a program that has been

designed specifically for bisulphite sequencing alignment. Before

alignment can be performed, Bismark requires the user to supply a

reference genome that the transforms into two files: A version

where all sequence cytosines are changed to thymines (C-to-T)

and a second version where all guanines are transformed to

adenines (G-to-A). This provides a bisulphite-transformed version of

the reference genome that accommodates both the top (C-to-T)

and bottom (G-to-A) strands of the bisulphite-converted reads.

Bismark also converts each read into a C-to-T and G-to-A version of

itself.

Following the conversion of the reference genome is the alignment itself. Bismark performs 4 parallel

alignments of each read in the sequence file; The C-to-T and G-to-A reads are each aligned to the C-to-T and

G-to-A genomes. The single best alignment of these 4 alignments, based on sequence mismatches, is kept

while the other 3 are discarded. The parallel alignments allow Bismark to determine the directionality of the

sequences without requiring strand specificity as an input.

The output of Bismark alignment is a final report that summarizes important statistics from the analysis. This

includes the number of sequence pairs and cytosines analyzed, the mapping efficiency and the methylation

rates in different sequence contexts. Bismark also outputs a BAM or SAM file that can be imported into a

Genome browser for visualization of the methylome. A final tool, Bismark Methylation Extractor, can be used to extract the position of every cytosine in a context-dependent manner using the SAM output.

Figure 4 – Bismark alignment. Reads are converted to

C-to-T and G-to-A versions and aligned to

equivalently converted versions of the reference

genome. The unique best alignment is selected and

the remaining three are discarded. Image from

Krueger & Andrews 2011.

Bismark

MethodologyBackground

Objectives

Results

Discussion

Conclusions

References• Andrews, S. (2010). FastQC: A quality control tool for high throughput sequence data. Reference Source.• Cokus, S. J., Feng, S., Zhang, X., Chen, Z., Merriman, B., Haudenschild, C. D., ... & Jacobsen, S. E. (2008). Shotgun bisulphite sequencing of the

Arabidopsis genome reveals DNA methylation patterning. Nature, 452(7184), 215-219. • Feng, S., Cokus, S. J., Zhang, X., Chen, P. Y., Bostick, M., Goll, M. G., ... & Ukomadu, C. (2010). Conservation and divergence of methylation

patterning in plants and animals. Proceedings of the National Academy of Sciences, 107(19), 8689-8694. • Hardcastle, T. J. (2013). High-throughput sequencing of cytosine methylation in plant DNA. Plant methods, 9(1), 1. • Krueger, F., & Andrews, S. R. (2011). Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics, 27(11),

1571-1572. • Krueger, F. (2015). Trim Galore. A wrapper tool around Cutadapt and FastQC to consistently apply quality and adapter trimming to FastQ files. • Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. journal, 17(1), pp-10. • Regulski, M., Lu, Z., Kendall, J., Donoghue, M. T., Reinders, J., Llaca, V., ... & Tingey, S. (2013). The maize methylome influences mRNA splice

sites and reveals widespread paramutation-like switches guided by small RNA.Genome research, 23(10), 1651-1662.