gcat-seek · web viewchapter 2. genome assembly ii: assembly size estimation and k-mer graphs...

125
GCAT-SEEK Eukaryotic Genomics Tutorial Preface. Tutorial Overview and Biological Setting Chapter 1. Genome assembly I: Quality control with FastQC and Trimmomatic Chapter 2. Genome assembly II: Assembly size estimation and k-mer graphs Chapter 3. Genome assembly III: Assembly algorithms Chapter 4. Genome annotation with Maker I: Overview and repeat finding Chapter 5. Genome annotation with Maker II: Whole genome analysis Chapter 6. Whole genome annotation: Miscellaneous methods Chapter 7. SNP calling and interpretation Glossary. Beta protocols for Module 3 Abyss, Redundans, BUSCO, and Orthofinder O verview ofEukaryoticGenom ics Genom e finishing deNOVO sequencing Genom e Assem bly(Ch1-3) •Identifyrepeats(Ch4) Train gene predictorsfrom genes on largestscaffoldsusing protein evidence from NCBI (Ch5) Full annotation (Ch5) Annotation checks(Ch5) M anual Annotation RNAseq VariantDetection (Ch7 ) ChIPseq Evolutionary analyses Unixskills(m 6) Gene Ontology 1

Upload: others

Post on 14-Feb-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

GCAT-SEEK Eukaryotic Genomics Tutorial

Preface. Tutorial Overview and Biological Setting

Chapter 1. Genome assembly I: Quality control with FastQC and Trimmomatic

Chapter 2. Genome assembly II: Assembly size estimation and k-mer graphs

Chapter 3. Genome assembly III: Assembly algorithms

Chapter 4. Genome annotation with Maker I: Overview and repeat finding

Chapter 5. Genome annotation with Maker II: Whole genome analysis

Chapter 6. Whole genome annotation: Miscellaneous methods

Chapter 7. SNP calling and interpretation

Glossary.

Beta protocols for Module 3 Abyss, Redundans, BUSCO, and Orthofinder

Preface. Tutorial Overview and Biological Setting

In this tutorial, a variety of organisms are analyzed to facilitate doing this kind of work with large numbers of students. The intent is that after having worked on the small scale examples, you will be able to work on full scale problems that may take more time. In chapters 1 to 3 a bacterial genome from the nitrogen-fixing genus Ensifer is analyzed courtesy of Prof. Bert Eardly from the GCAT-SEEK network. This small genome will allow us to speed up a computationally intensive part of the analysis for the purposes of teaching. Some important differences between bacterial and eukaryotic genomes, apart from larger size and higher number of repetitive elements in eukaryotes, is heterozygosity. The network is working on additional approaches dealing with highly heterozygous genomes, that should be available on the website by summer 2017 (gcat-seek.weebly.com).

In chapters 4 and 5 we perform repeat finding and focus on how to identify and characterize protein-coding genes in a newly-sequenced genome. This process is called structural gene annotation. The end result of completing chapter 4 on a new genome is a new version of the genome where repetitive elements are "masked" in such a way that they will not interfere with gene annotation. The tutorial contains some steps that are important at a whole-genome scale, but may seem trivial given the artificially simplified example.

In chapter 5 we focus on gene annotation with MAKER. We annotate only a small section of DNA from a relatively complete genome from the flag rockfish Sebastes rubrivinctus. This is among the highest quality genome within the genus of 100+ species, and it is close in phylogenetic relationship to the species we will focus on in chapter 7, the black and yellow (S chrysomelas) and gopher rockfishes (S. carnatus). Rockfishes of the genus Sebastes are a diverse marine fish assemblage with an estimated 100+ species native to the west coast of North America. Rockfishes support important commercial and recreational fisheries and are the dominant assemblage on most cold temperate reefs. These live-bearers have a low intrinsic rate of population increase and highly sporadic recruitment, releasing large numbers of pelagic larvae into a variable coastal environment. Their slow growth rates render them vulnerable to overfishing. Uncertainty in the success of progeny produced in any particular year has favored an evolutionary strategy whereby some species of the Sebastes genus have extremely long lifespans and do not show signs of aging (negligible senescence), while others demonstrate typical aging patterns and have short lifespans. S. rubrivinctus has a maximum lifespan of only 18 years, but it is closely related to S. nigrocinctus, which has a maximum age of at least 116 years. Researchers are expecting to gain insight into the genetic mechanism for negligible senescence by sequencing and comparing the genomes of these two species. In the tutorial, we will use S. rubrivinctus to assist analysis of other species in the genus. We focus on a small section of DNA that undergraduates at Juniata College identified as having interesting differences between the two species pairs. In chapter 6 some helpful skills in the Linux operating system are reviewed.

In chapter seven we perform SNP discovery in a scenario related to understanding depth-mediated speciation in marine fishes. Detailed examination of adaptation and reproductive isolation at the molecular level is very challenging in most species, yet to achieve a full understanding of these processes we must determine the genomic underpinnings of population divergence. The genus Sebastes boasts over 100 species worldwide and presents a natural evolutionary laboratory within which one may pursue hypotheses with the rare advantage of phylogenetic replication. There are a series of water depth-segregated sister species within the Sebastes, allowing for replicate studies to identify loci that may have produced and/or maintain species boundaries through reproductive and ecological differentiation. We previously performed reduced representation whole genome population genomic divergence analysis between individuals of two depth-segregated sister species. Interspecific divergence levels were mapped against the draft Sebastes rubrivinctus genome, which itself was mapped against the finished three-spined stickleback genome to establish approximate gene order. Over the whole genome, three regions of ~0.5Mbp in length were divergent between gopher rockfish S. carnatus and black-and-yellow rockfish S. chrysomelas. Only one such region was also divergent between a different depth-separated species pair. The region in common contained 43 candidate genes that may be related to prezygotic reproductive isolation and niche partitioning. In chapter seven we extend this analysis by analyzing new whole genome data on S. carnatus and S. chrysomelas. We focus on identifying polymorphisms within genes within the divergence island that change amino acid sequence, and differ between species. This analysis will greatly reduce the number of target genes in regions that may have been involved in producing or maintaining divergence between the species pair.

Chapter 1. Genome assembly I: Quality Control with FastQC and Trimmomatic

Background-FastQC and Trimmomatic

· In order to get started you will need some raw data to play with. This tutorial assumes you are using practice data that are currently on the Juniata HHMI server and is referenced throughout the tutorial. You may have raw data files from your own genome sequencing experiment, may download raw reads from a source such as the NCBI short read archive (SRA) on a species you are interested in, or request practice data from GCAT-SEEK network members (for options see gcat-seek.weebly.com/eukaryotic-genomes.html). However, if you want to use different data then you will need to change directory and file names throughout the tutorial to reflect the name and location of your files.

Abyss2

Redundans

Redundans

· You will first want to check whether the sequencing run produced high quality data using tools like FastQC (Andrews 2010). You will use FastQC to obtain a summary of data quality over all individual sequencing reads in your dataset. Often a quality filtering step can aid in downstream analysis, guided by the FastQC results.

· Quality filtering and error correction has a profound influence on the quality of a genome assembly (Kenney et al. 2010; Salzberg et al. 2012). An assembly is comprised of contigs that have been joined into scaffolds, and contigs that could not be joined. While you might want to use highly stringent criteria for quality filtering, this stringency can reduce the amount of data to the point that coverage is too low for assembly, and good data are thrown out. Conversely, including data of poor quality that may prematurely terminate or make false connections. Contigs are the first pieces of an assembly, and are made using de Bruijn graphing techniques (Compeau et al. 2011; and see exercise below). de Bruijn graphs are an efficient means to determine overlap among sequence reads from massive datasets. These graphs rely on overlaps of small sections of individual reads from which DNA sequence is deduced. Many assembly programs have built in error correction algorithms that run before de Bruijn Graphs are constructed; genome assembly software with internal error-correcting functionality include SOAPdenovo (Li et al. 2010; Luo et al 2012), Allpaths-LG (Butler et al. 2008), and MaSurCA (Zimin et al 2013). An alternate strategy to error correction is trimming the ends off of reads of lower quality or removing them altogether. This approach is less computationally expensive, faster, and eliminates redundancy if the assembler being used already has some error-correction function. Programs like Trimmomatic (Bolger et al, 2014) are widely used because of their simplicity and relative speed. Trimmomatic is incompatible with the assembler MaSurCA, however, which makes it clear that reading the literature and technical manuals for each program used in this tutorial is necessary to ensure proper use for your research setting.

Raw data from sequencing core facilities often arrives in the form of large files in FASTQ format. We will use the program FastQC to analyze the data. Each sequence read found within a FASTQ file takes up 4 lines and looks like this:

@SRR022868.1209/1

TATAACTATTACTTTCTTTAAAAAATGTTTTTCGTATTTCTATTGATTCTGGTACATCTAAATCTATCCATG

+

III

· This is the forward read of a genomic DNA fragment that was sequenced in the forward and reverse direction (ie. is a paired-end sequence read). The first line is the FASTQ header with the name of the read, ending with either "/1" or "/2" for paired-end data. All forward sequence reads are in one enormous FASTQ file, and all reverse sequence reads are in another of the same size. Forward and reverse sequence reads from the two files are in exactly the same order in both files. Most programs expect the pairs to be in the same order in two files, like they are in this example. This is important because if a forward read is of poor quality, and is filtered out by a quality control program like Trimmomatic, then the good quality read that has no pair will be removed as well and put into a new file with other unpaired sequence reads.

· The second line is the sequence. N's are used to identify nucleotides that were not determined by the sequence; other IUPAC ambiguity symbols are not used.

· The third line is a separator between the sequence and quality scores that may include some optional text.

· The fourth line is comprised of "Phred" or "Sanger" quality scores. There is one score for each nucleotide. These are calculated as -10 * log10 (probability of error). A nucleotide with an error probability of 0.01 would have a score of 20. In this equation, log10 0.01 = -2, and (-2 * -10) = 20.

· See http://en.wikipedia.org/wiki/FASTQ_format for more details. Depending on your sequencing platform, you may have to perform a mathematical operation before converting ASCII characters (a standard character encoding for text) to a Phred score.

In order for the quality scores to take only one character of space, they are encoded as ASCII text, each character of which has a numerical, as well as text value. You can find a table with the values at http://en.wikipedia.org/wiki/ASCII (it is halfway down the page entitled, "Printable characters"). Note that the integer 33 is represented by "!" and 73 by "I". To convert symbols to numerical error probabilities, one would take the decimal value of the ASCII character (e.g. 73), subtract 33 (73-33=40), divide by -10 (40/-10=-4), and raise 10 to that value (e.g. 10^-4= 0.0001).

Which symbol represents the Phred quality score of 26 according to the ASCII conversion table (linked above)?

The following boxplot shows Phred quality scores (see above) on the y-axis and position within reads starting from the 5' end (for all read simultaneously). Boxplots summarize quality data across the entire data set for each position. A decline in quality is shown from 5' to 3' position. The center of the box is the median, lower end of the box is the 25th percentile (Q1), top of the box the 75th percentile (Q3), whiskers show the data at about two standard deviations from the median [>1.5x(Q3-Q1) distant from Q1 or Q3] and blue stars are outliers.

Recall that the Phred score is on the Y-axis and position within the read on the X-axis. Reads can be considered low-quality if their Phred score falls below 20.

Does this look like a good run? If not, draw a boxplot of a good run.

If one could trim or filter out poor quality bases or reads, there may be enough data to assemble a genome, even if a sequencing run produces many poor-quality reads. Today, we will use Trimmomatic to improve raw data from the Ensifer bacterial genome, and use FastQC to see how the data differ before and after Trimmomatic. "Ensifer" refers to an unknown species of nitrogen-fixing bacteria from the genus Ensifer.

Learning Outcomes

1. You will practice remote cluster computing

2. You will practice using the Linux Operating System

3. You will understand Phred quality scores and construct box plots

4. You will perform file transfer from Juniata HHMI cluster to desktop using Cyberduck

5. You will use and understand data quality assessments using FastQC

6. You will use the program Trimmomatic to clean data

Vision and Change core competencies addressed in the chapter

1) Ability to apply the process of science: Evaluation of experimental evidence

2) Ability to use quantitative reasoning: Developing and interpreting graphs, Applying statistical

methods to diverse data, Mathematical modeling, Managing and analyzing large data sets

3) Use modeling and simulation to understand complex biological systems:

Applying informatics tools, Managing and analyzing large data sets

Sequencing requirements

None

Computer/program requirements for data analysis

Linux OS, FastQC, Trimmomatic, Cyberduck

Protocols

I. FastQC

A. Log onto the cluster. If you are already logged in, use the cd program to move to your home directory.

B. Make a directory entitled "1_trimmomatic" and enter it.

$mkdir 1_trimmomatic

$cd 1_trimmomatic

C. Copy all the FASTQ and qsub files from a shared directory into your new "1_trimmomatic" directory. It will take a moment. The command will also avoid copying embedded folders, which is OK.

$cp /share/apps/sharedData/GCAT/EukGenWorkshop16/1_trimmomatic/* .

Now use "ll" (ie. lower case LL) to view the files in your "1_trimmomatic" folder. There should be three files. Notice that the read files are many mega-bytes in size, and we will delete them later to conserve space on the cluster.

D. Run the program FastQC on one of the data files to create the quality score boxplot and produce other diagnostics described in the box below.

$module load fastqc

$fastqc Ensifersp_R1_100X.fastq

E. After FastQC finishes, transfer the html file produced from the analysis to your laptop using Cyberduck and then open the file to view the following diagnostics from the FastQC manual: (also see http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/).

a. Basic Statistics. This parameter is simply an overview of your input file. It tells you the name of your input file, the file type, total number of sequences, etc. This parameter is useful to ensure that FastQC correctly called your file type or any other information about your library. If the information appears incorrect, you can set the options to create a better analysis by telling FastQC what the file type is using the option –f or – -format ($fastqc –help for more options).

b. Per Base Sequence Quality. This parameter gives you the quality score for each base of your sequence. Most runs will degrade as the run progresses. The central red line indicates the median value, the yellow box is the interquartile range, the upper and lower whiskers represent the 10% and 90% points, the green section of the background indicates a good read, yellow medium and red a bad read.

c. Per Sequence Quality Scores. This analysis module allows you to see if any of your sequences have universally low quality scores. It is often the case that a subset of sequences will have universally poor quality, so identifying them is important.

d. Per Base Sequence Content. Shows the proportion of each base position in a file for which each of four normal DNA bases has been called over all sequence reads in the file analyzed. In a random library, there should be little to no difference between in base content with position along a sequencing read.

e. Per Base GC Content. This analysis plots out the GC content of each base position averaged over all reads. This parameter should be relatively consistent.

f. Per Sequence GC Content. A plot showing the measure of GC content across the whole length of each sequence in a file, compared to a modeled normal distribution of GC content.

g. Per Base N Content. An N within a sequence represents a location in which the sequencer was unable to make a base call with sufficient confidence. This analysis shows the percent of base calls at each position for which N was called.

h. Sequence Length Distribution. Generated graph showing the distribution of fragment sizes.

i. Duplicate Sequences. This result shows the degree of duplication for every sequence within your library and creates a plot showing the relative number of sequences with different degrees of duplication.

j. Over-Represented Sequences. A list of all the sequences that make up more than 0.1% of the total library.

k. Over-Represented K-mers. Plot of counts enrichment of every 5-mer in the library.

l. In addition to running FastQC from the command line, the program can also be downloaded as a user-friendly, graphical interface from: http://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc

What parameters from this FastQC report concern you most about the raw data? Hint: look at the warning symbols that FastQC reports.

F. Before we move on to the next step, let's get samples of our data at different coverages so we can qualitatively see how coverage affects downstream genome assembly analysis. Copy and paste the following commands. The head program will print to screen the specified number of rows from the top of the specified file, and here we write 535,000 rows to files using ">".

$head -535000 Ensifersp_R1_100X.fastq > Ensifersp_R1_10X.fastq

$head -535000 Ensifersp_R2_100X.fastq > Ensifersp_R2_10X.fastq

II. Trimmomatic

Now let's see how we can improve our data. We will run a quality filter on the ends of the "Ensifersp_100X" and _10X raw reads by using the options ILLUMINACLIP, TRAILING, LEADING, SLIDINGWINDOW, and MINLEN. ILLUMINACLIP removes artifacts from sequencing. TRAILING and LEADING require you to provide a Phred score that is the minimum quality score the base pairs at the ends of the reads must meet, or else be trimmed. SLIDINGWINDOW is similar, but ensures that a "window" of bases doesn’t drop below an average Phred score. For example, SLIDINGWINDOW:4:15 would read with a 4-base wide sliding window, cutting when the average quality per base drops below 15. MINLEN states the minimum length necessary to be retained in the dataset, so sequence reads that are trimmed to a size less than MINLEN are discarded. Try executing the Trimmomatic program on the Juniata HHMI cluster as a qsub script. A qsub script is a program that will tell a computer cluster how to run a job on "worker" or "compute" nodes, rather than running the program on the "head node" which is the part of the cluster used to interface with users. Check the address of the data files before trying to run this script and make sure to edit the directories.

A. Use the nano program to view the qsub file you downloaded (see the command below).

$nano trimmomatic.qsub

#!/bin/bash --login

#PBS -N trimmomatic_test

#PBS -j oe

#PBS -m abe

#PBS -M YOUR_EMAIL_ADDRESS

#PBS -q default

#PBS -l nodes=1:ppn=4,mem=20gb,walltime=01:00:00

workdir=/home/USERNAME/1_trimmomatic

cd $workdir

java -jar /share/apps/Installs/walls/Trimmomatic-0.33/trimmomatic-0.33.jar \

PE -phred33 \

Ensifersp_R1_100X.fastq \

Ensifersp_R2_100X.fastq \

Ensifersp_100X_TrimForward_Paired.fq \

Ensifersp_100X_TrimForward_Unpaired.fq \

Ensifersp_100X_TrimReverse_Paired.fq \

Ensifersp_100X_TrimReverse_Unpaired.fq \

ILLUMINACLIP:/share/apps/Installs/walls/Trimmomatic-0.33/adapters/TruSeq3-PE.fa:2:30:10 \

TRAILING:20 LEADING:20 SLIDINGWINDOW:4:15 MINLEN:36

java -jar /share/apps/Installs/walls/Trimmomatic-0.33/trimmomatic-0.33.jar \

PE -phred33 \

Ensifersp_R1_10X.fastq \

Ensifersp_R2_10X.fastq \

Ensifersp_10X_TrimForward_Paired.fq \

Ensifersp_10X_TrimForward_Unpaired.fq \

Ensifersp_10X_TrimReverse_Paired.fq \

Ensifersp_10X_TrimReverse_Unpaired.fq \

ILLUMINACLIP:/share/apps/Installs/walls/Trimmomatic-0.33/adapters/TruSeq3-PE.fa:2:30:10 \

TRAILING:20 LEADING:20 SLIDINGWINDOW:4:15 MINLEN:36

B. To submit this job to the cluster, change the e-mail address and highlighted path so they lead to your own "1_trimmomatic" folder ("USERNAME " becomes your last name). To check the location of your directory, hit "Ctrl^X" followed by "Y" to exit the qsub file and save changes, then use the pwd program to get your directory. For example, typing pwd in your "1_trimmomatic" folder should give:

/home/USERNAME/1_trimmomatic

Once you've changed the USERNAME to your username. You can submit your job. We will run the Trimmomatic program using qsub. The qsub program submits jobs to the cluster queue. The Juniata College HHMI cluster consists of a head node and four compute nodes. Jobs are submitted to a program called qsub which then delegates the cluster resources and distributes the job across the compute nodes. Jobs (programs) that are not submitted to qsub will run right on the head node and will not take advantage of the processors or memory available on the compute nodes. To submit the job to the cluster, type:

$qsub trimmomatic.qsub

If your job fails immediately, check the "trimmomatic_test.o####" file to see what went wrong.

How does qsub work?

· nodes: sets number of nodes you want to run. The Juniata HHMI cluster has 6 nodes and hundreds of users, so this should usually be 1.

· ppn: sets number of processors per node you want to use. The maximum number of processors per node that may be used is 32.

· N specifies your job name

· oe tells PBS to put both normal output and error output into the same output file. The qsub output file will be generated in the working directory noted in the qsub file. You expect to see 4 files ending in .fq and a file named error.txt.

· module: loads programs on the worker nodes

To show a full listing of processes running in qsub. Wait a few minutes until the job's status changed from R (running) to C (complete). Remember that –N in the qsub file dictates the name that you'll see here.

$qstat -all

If you notice an error in your script, you can cancel your job by typing:

$qdel [JOB NUMBER]

Once you notice that your job has changed from "R" to "C," you are ready to use fastqc again to see how the data changed.

A. Let's see how the quality of the data changed. Rerun the FastQC program as above (one node, four processors is good for a fairly small job) to get stats on one of the trimmed datasets.

B. Make sure that you're still in your "1_trimmomatic" directory then run the FastQC program on one of the .fq files.

$fastqc Ensifersp_100X_TrimForward_Paired.fq

C. Open this file like you did in with the first FASTQ file (transferring to your laptop with Cyberduck) to examine differences between the original data and data that has been quality trimmed and filtered.

Assessment

1. How many reads were filtered? Examine the two read quality boxplots that you generated in the above exercises. How do they differ? What didn't change? Would you increase, decrease, or keep the same stringency if you were running this on your own data? Recall that stringency was determined by options for the trimmomatic-0.33.jar command in the qsub file (step II.A above).

2. Was the parameter you were worried about in the discussion question from "Protocols. Part I" above sufficiently changed? Comment on any characteristic of the original data that you thought could be improved using trimmomatic.

3. Use the web page for Trimmomatic: (http://www.usadellab.org/cms/index.php?page=trimmomatic), to describe how the below command would modify the given file: Does this seem reasonable?

Java -jar trimmomatic-0.35.jar PE -phred33 input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq2-PE.fa:2:30:10 LEADING:40 TRAILING:40 SLIDINGWINDOW:10:35 MINLEN:50

4. Note the section called "Per Base Sequence Content" in both FastQC reports. Note the dip in A's and rise in C's at the ends of run. Among other things, this can be caused by biases associated with the Illumina System. Describe a trimming strategy that could be used to address this problem. Hint: if stumped, move on to question 5 for more ideas.

5. Copy over and open a new qsub file ("trimmomatic_stringent.qsub") that cuts more stringently by using the following commands:

$cp /share/apps/sharedData/GCAT/EukGenWorkshop16/1_trimmomatic/enrichment/*.qsub .

$nano trimmomatic_stringent.qsub

Describe in detail how this job differs from the first qsub script (ie how does this qsub script you ran earlier).

Timeline of Chapter

3 hours

Discussion topics for class

1. What would the Phred score be for an error probability of 0.001? of 0.05?

2. What is the error probability for an ASCII quality score of "!"?

3. Why is using the ASCII system preferential to raw scores?

References

Andrews, Simon. 2010.FastQC: A quality control tool for high throughput sequence data. Reference Source.

Bolger, A. M., Lohse, M., and Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170.

Butler J, MacCallum I, Kleber M et al. 2008. ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res. 18: 810-820.

Compeau, P.E., Pevzner, P.A. and Tesler, G., 2011. How to apply de Bruijn graphs to genome assembly. Nature biotechnology, 29:987-991.

Kenney DR, Schatz MC, Salzberg SL. 2010. Quake:quality-aware detection and correction of sequencing errors. Genome Biology 11:R116

Li R, Zhu H, Ruan J et al. 2010. De novo assembly of human genomes with massively parallel short read sequenceing. Genom Res. 20:265-272 [SOAPdenovo 1]

Luo R, Liu B, Xie Y, et al. 2012. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1:18. [important online supplements]

Salzberg SL, Phillippy AM, Zimin A, et al. 2012. GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res 22: 557-567. [important online supplements!]

Zimin AV, Marçais G, Puiu D, Roberts M, Salzberg SL, Yorke JA. The MaSuRCA genome assembler. Bioinformatics. 2013. Nov 1;29(21):2669-77.

Further Reading

Haddock SHD, Dunn CW. 2011. Practical computing for biologists. Sinauer Associates

Chapter 2. Genome Assembly II: Assembly Size Estimation and K-mer Graphs

Background

K-mer graphs are important tools for understanding expected assembly size. Assembly size refers to total length of DNA in contigs and scaffolds resulting from running an assembly program. Assembly size can be less that genome size due to repetitive DNA elements looking the same and collapsing into a single contig. Assembly size can be larger than genome size due to read errors or heterozygosity causing sequences to look different enough that they assemble into different contigs. K-mer graphs are useful for optimization of parameters for some genome assemblers, and for understanding prevalence of errors, heterozygosity, and repetitive DNA. An underlying assumption of whole genome sequencing projects is that reads are randomly distributed around the genome at approximately equal coverage. If a genome is sequenced at 100x coverage, one expects each nucleotide to be covered 100 times by randomly distributed reads. A common strategy in sequence analysis is decomposition of reads into shorter k-mers. The k in k-mer refers to the number of nucleotides in the shorter read. For example, for a smaller read of 3 nucleotides, k = 3. While not all possible 100 bp reads in the genome will be generated by a sequencer even at 100x coverage, each 100bp (L= read length) read will generate (L-k+1) k-mers. For example, the read GCAT, would decompose into (4-3+1) or two 3mers, GCA and CAT.

How many 3mers would the read GCATSEEK produce?

K-mer size is one of the most important factors in assembling reads derived from a data set. Although it will be discussed more thoroughly in the next section, it is important to appreciate the relevance of proper k-mer size. Some genome assembly mechanisms rely on overlap of a complete set of k-mers from beginning to end of the genome, which differ by only one base from each other. When it comes to genome assembly, if k-mer size is too short, assembly will be very confused by repeats. However, if k-mers are too long, many will be missing in the dataset, resulting in fragmentation. The figure to the right shows (A) if k-mer size is long (e.g. equal to read length), a high overall coverage is necessary to obtain a complete set of k-mers differing only by one nucleotide each. In (A) it takes ~20 sequence reads to produce all k-mers though the sequenced region. In contrast, (B) shows how a complete set of smaller k-mers through the sequenced region can be obtained by lower coverage (just two reads overlapping by an amount of at least size k).

The KmerGenie program (Chikhi and Medvedev 2013) is a tool designed to find the optimal k-mer size for a dataset with respect to producing the largest genome assembly. KmerGenie's developers designed an algorithm that predicts assembly size for different k-mer sizes. Since successive k-mers differ from one another by only a single base, the number of distinct k-mers is an estimate of assembly size.

K-mer histograms are a common way to analyze single genomes that are sequenced at high coverage. They can be constructed for either haploid or diploid genomes. In the diploid model, we expect to see a broad peak representing abundance of true (i.e. non-error) k-mers. The abundance of k-mers will increase with overall sequencing coverage. For diploid organisms, heterozygous regions will form a minor peak at half the abundance of homozygous regions since there is half the genome coverage of those k-mers. Figure 1 (below) shows an example of a histogram for a k-mer size of 31 in the banana slug genome assembly. It is a plot of abundance of k-mer in a dataset (how many times a given k-mer is found), against how many distinct k-mers were found at that abundance. In the example below, the broad peak shows that many k-mers were found 18X in the dataset, and many were unique, at abundance 1. Heterozygous k-mers were found at about 9x. Sequencing errors affect many k-mers, and cause a low abundance bump in k-mer histograms. For genomes, the shape of the k-mer graph is fairly similar for k-mers 15 or larger. Below that size many k-mers are repeated many times throughout the genome just by chance. At 15 and larger there are two characteristic peaks as described above for errors and true coverage. Note that repetitive DNA will have higher coverage and causes a "heavy tail" for the main peak.

Figure 1. Abundance histogram from the banana slug genome. Red is the fit of the complete statistical model of the histogram (erroneous K-mers + genomic K-mers). Green represents homozygous k-mers, while blue represents heterozygous K-mers in the diploid model ( from HTTPS://BANANA SLUG.SOE.UCSC.EDU/DATA_OVERVIEW:2015:ANALYSIS:KMERGENIE).

Learning Outcomes

1. You will construct and interpret k-mer graphs using KmerGenie

2. You will make decisions on improving data for downstream analysis

Vision and Change Competencies addressed in the chapter

· Ability to apply the process of science: Evaluation of experimental evidence

· Ability to use quantitative reasoning: Developing and interpreting graphs, Applying statistical methods to diverse data, Managing and analyzing large data sets

· Use modeling and simulation to understand complex biological systems: Applying informatics tools, managing and analyzing large data sets

Sequencing requirements

None

Computer/program requirements for data analysis

Linux OS, KmerGenie

Optional Cluster: qsub

If starting from Window OS: Putty

If starting from Mac or Linux OS: SSH

Protocols

Log into the Juniata HHMI cluster as you did in pre-Tutorial chapter 1. Recall that the external IP address is 192.112.102.21 and internal is 147.73.20.27.

Use the cluster to run KmerGenie

1. Log onto the cluster.

1. From your home directory, make a directory entitled, "2_kmergenie"

$cd

$mkdir 2_kmergenie

1. Move into "2_kmergenie" as your working directory

$cd 2_kmergenie

1. Examine contents of a shared folder for this chapter using list long or ll (lower case LL) below. It contains 4 copies of FASTQ files you made in chapter 1. It also contains a qsub file called "kmergenie.qsub". Then copy "kmergenie.qsub" using cp.

ll /share/apps/sharedData/GCAT/EukGenWorkshop16/2_kmergenie/

cp /share/apps/sharedData/GCAT/EukGenWorkshop16/2_kmergenie/kmergenie.qsub .

1. Open the qsub file for editing using nano. Then edit the highlighted sections with your email address and username in order to have the right working directory. The script is explained below, but get it running first for time-efficiency.

$nano kmergenie.qsub

#!/bin/bash --login

#PBS -N kmergenie_qsub

#PBS -j oe

#PBS -m abe

#PBS -M YOUR_EMAIL_ADDRESS

#PBS -q default

#PBS -l nodes=1:ppn=8,mem=20gb,walltime=01:00:00

module load Python

module load RMod

module load kmergenie

workdir=/home/USERNAME/2_kmergenie

cd $workdir

cat /share/apps/sharedData/GCAT/EukGenWorkshop16/1_trimmomatic/*.fastq > input.fastq

kmergenie input.fastq -o 100X_untrimmed

rm input*

cat /share/apps/sharedData/GCAT/EukGenWorkshop16/1_trimmomatic/Trimmomatic_Outputs/Ensifersp_100X_TrimForward_Paired.fq /share/apps/sharedData/GCAT/EukGenWorkshop16/1_trimmomatic/Trimmomatic_Outputs/Ensifersp_100X_TrimReverse_Paired.fq > input.fastq

kmergenie input.fastq -o 100X_Trimmed

rm input*

cat /share/apps/sharedData/GCAT/EukGenWorkshop16/1_trimmomatic/Trimmomatic_Outputs/Ensifersp_10X_TrimForward_Paired.fq /share/apps/sharedData/GCAT/EukGenWorkshop16/1_trimmomatic/Trimmomatic_Outputs/Ensifersp_10X_TrimReverse_Paired.fq > input.fastq

kmergenie input.fastq -o 10X_Trimmed

rm input*

1. Send the job to the worker nodes by entering:

$qsub kmergenie.qsub

These commands are now running on qsub. THERE IS NO NEED TO RE-RUN THESE COMMANDS

How does KmerGenie work?

The lines at the bottom of the qsub script are now running five different commands sequentially

1. The following lines load the programs necessary to make KmerGenie work, including KmerGenie.

module load Python

module load RMod

module load kmergenie

2. The following line concatenates (merges) all the raw FASTQ files for Ensifer from the shared directory into a single file in your current working directory called, "input.fasta". The other files may look intimidating, but the only difference is that all the files' paths are written out in full so that the program doesn't concatenate the wrong files by mistake.

cat /share/apps/sharedData/GCAT/EukGenWorkshop6/2_kmergenie/*.fq > input.fastq

3. The following line invokes the program KmerGenie to make k-mer histograms and determine an optimal k-mer size. Details regarding this process can be found in Informed and automated k-mer size selection for genome assembly (Chikhi and Medvedev 2013). The optional "--diploid" flag means that we could model another data set as a diploid, which would be optimal for non-prokaryotic studies.

kmergenie input.fq

NOTE: -h or --usage after the command will explain the options available with most programs

4. The following removes the large unnecessary intermediate datasets using rm. Note that this will also delete any other files starting with "input" or "output" in your working directory. You already have these files in your "1_trimmomatic" folder, so we don't need to waste space holding duplicates.

rm input*

1. Use the program Cyberduck to transfer the report.html files that you generated to your laptop.

Assessment

1. Look at the html report that you generated for your data. According to this report, what is the optimal k-mer size for your untrimmed data? The 100X coverage? The 10X coverage?

2. Offer an explanation regarding the large difference in question (1) in optimal kmer size between the 10X and 100X datasets.

3. Based on this information, what k-mer size would you want to try for assembly of the 100X trimmed reads? Explain. Use the kmergenie manual to help you answer.

Time line of chapter

Two hours of lab

Discussion topics for class

Go to your histogram report. Examine the plot of Number of Genomic K-mers versus K-mer size and explain the plot.

References

Kenney DR, Schatz MC, Salzberg SL. 2010. Quake:quality-aware detection and correction of sequencing errors. Genome Biology 11:R116.

Marcais G, Kingsford C. 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 27:764-770. [Jellyfish program]

Chikhi R, Medvedev P. 2013 Informed and automated k-mer size selection for genome assembly. Bioinformatics 30: 31-37.

Chapter 3. Genome assembly III: Assembly algorithms

Background

Modern genome sequencing involves computer-based reconstruction of the sequence of the millions of base pairs that makes up eukaryotic chromosomes, from millions of random reads of DNA sequence about 100bp long that cover the entire length of the DNA double helix in each chromosome. Large fragments of DNA sequence can be computationally assembled from small fragments of DNA sequence, if the fragments sufficiently overlap each other. A new analytical technique called de Bruijn graphing was developed to deal with the difficult computational problem of comparing every DNA sequence fragment against every other DNA sequence fragment, when a single dataset contains millions of DNA sequence fragments. The method involves dividing raw DNA reads into short stretches of DNA called K-mers, and extension of DNA chains from exact matches of K-mers differing by a single nucleotide from each other. Note that any particular genomic sequence dataset will not contain all possible k-mers, since only a subset of k-mers will actually be part of a species’ genome.

As an example of de Bruijn graphing technique, consider the small circular genome shown below with the sequence GCAT:

1. Say the following set of 3bp reads are obtained from randomly sequencing the genome {ATG, CAT, TGC, GCA}. To construct a genome using a de Bruijn graph:

2. Break the reads into smaller segments called k-mers. A k-mer of size 2 will produce the following dimers: {AT, TG, CA, AT, TG, GC, GC, CA}. Eliminate duplicate k-mers.

TG

GC

CA

AT

AT

TG

GC

CA

T

G

C

A

4. Connect the unique vertices using numbered, directed arrows (edges) labeled with kmers that contain both the prefix and suffix. Kmers extend previous vertex by one bp. Consider how each kmer connects vertices. Use each edge (kmer) only once. Cross out kmers as they are used. The graph is done when all kmers are considered. If your graph forms a loop, it forms a circular genome. To construct the genome sequence, connect the last letter of each kmer in the order determined by the graph. If your graph is linear, use the entire first kmer and then connect the last letter of each successive kmer to get the sequence.

3. Start the graph by making vertices (bubbles) that contain prefixes (first k-1 letters of the kmers) and suffixes (last k-1 letters of the kmers). Eliminate duplicates. In this case it is simply the letters A,G,C, & T.

Q1. Now you try. Say the following set of 4bp reads are obtained from randomly sequencing a different genome {GCAT, GGCA, CATT, ATTA, TTAG, AGGC, TAGG}. Construct the genome sequence using a de Bruijn graphing approach, using k-mer size of 3, following the 4 steps above!

Q2. Fragmentation. The problem with genome construction from random (aka shotgun) short read data is that, despite high coverage of the genome, genome assemblies remain fragmented. What would your assembly above look like if you only had reads {GGCA, ATTA, AGGC}? How many sequence fragments (contigs) would result? What would the sequence be?

Q3. Repeats. Another major problem with genome construction is that repetitive DNA sequences make it impossible to reconstruct genomes from short reads. The resulting graphs are "tangled"." Try reconstructing the circular genome (CTATCATCG) using the de Bruijn graphing approach, from the following set of 3bp reads {CTA, TAT, ATC, TCA, CAT, TCG, CGC, GCT}. You will find that there is more than one path through the graph, which would terminate assembly.

Errors. Errors in sequence reads greatly complicate graph construction. Each error affects many k-mers, and cause "bulges" in de Bruijn graphs. In fact, if k-mer size is 4, each error will affect 4 k-mers (see supplementary figures in Compeau et al. 2011). Common issues that errors create in graph construction are illustrated by figures 3 and 4 of Miller et al. (2010) Frequency of k-mers is used by most genome assemblers to error-correct graphs since random, low frequency errors should be far less supported than true k-mers. After initial graphing, reads that span the length of a repeat, paired ends spanning either side of a repeat, and knowledge of distance between read pairs can be used to resolve ambiguities in a graph.

Key Points

Repeats can be resolved if k-mers or reads extend longer than the repeat length, or if insert size exceeds repeat length , and both ends of the insert are sequenced. However, large k-mers may result in genome fragmentation (i.e. reduced contig size) because de Bruijn graphs extend DNA chains by one nucleotide at a time, and a single missing k-mer will break extension.

SOAPdenovo

SOAPdenovo (Li et al. 2010; Luo et al 2012) is a common genome assembly program using de Bruijn graphs comprised of 4 distinct functions that typically run at the same time.

Pregraph: construct k-mer-graph

Contig: eliminate errors and output contigs

Map: map reads to contigs

Scaff: construct scaffolds

All: do all of the above in turn

Error correction in SOAPdenovo itself includes calculating k-mer frequencies and filtering k-mers below a certain frequency (in high coverage sequencing experiments, rare k-mers are usually the result of sequencing error), and correcting bubbles and frayed rope patterns. It creates de Bruijn graphs and, to create scaffolds, maps all paired reads to contig consensus sequences, including reads not used in the graph. Below we use all SOAPdenovo modules to build an assembly on data that has already been through error correction screens.

Learning outcomes

1. You will perform genome assembly by de Bruijn graphing by hand and using Linux OS

2. You will explore effects of key variables on assembly quality

3. You will measure assembly quality

Vision and Change core competencies addressed in the chapter

1) Ability to apply the process of science: Observational strategies, Hypothesis testing, Experimental design, Evaluation of experimental evidence, Developing problem-solving strategies

2) Ability to use quantitative reasoning: Developing and interpreting graphs, Applying statistical

methods to diverse data, Mathematical modeling, Managing and analyzing large data sets

3) Use modeling and simulation to understand complex biological systems: Computational

modeling of dynamic systems, Applying informatics tools, Managing and analyzing large data sets, Incorporating stochasticity into biological models

Sequencing requirements

None

Computer/program requirements for data analysis

Linux OS, SOAPdenovo2, Quast, MaSurCA

If using a Cluster: qsub

If starting from Window OS: Putty

If starting from Mac or Linux OS: SSH

Protocols

I. Genome assembly using SOAPdenovo2 in the Linux environment

We will perform genome assembly on Juniata HHMI cluster. In this environment CAPITALIZATION and PERFECT SPELLING REALLY, REALLY COUNT!!! For purposes of brevity, our work will focus on bacterial genomes, but the computational work we do can be applied to even mammalian-sized genomes (>1Gbp). We will assemble a genome from 2 error-corrected bacterial genome files. The bacterial genome size is ~ 5Mbp. The raw data include a paired-end fragment library with an "insert length" of 550bp, in an "innie" orientation (forward and reverse reads facing each other), with 2x250bp paired end MiSeq reads providing up to 100x genome coverage.

A. Go to the directory with your last name (using cd) and make a directory entitled, soap, check that it is there, and move into it.

$cd

$mkdir 3_soap

$ll

$cd 3_soap

B. Copy all the data from our GCAT shared directory to your directory. Don't forget the space and "." at the end of the line. Your may divide analysis of different datasets among team members. After copying the data, use ls -l to see what was copied, as well as how big the different files are.

$cp /share/apps/sharedData/GCAT/EukGenWorkshop16/3_soap/config* .

$cp /share/apps/sharedData/GCAT/EukGenWorkshop16/3_soap/soap.qsub .

$ll

C. Open the "config.txt" file using nano. This text file tells the SOAPdenovo program which files to use, and where you will enter important characteristics of the data. Examine contents using nano.

$nano config100XUntrimmed.txt

The file should read as follows. Note that # signs indicate comments explaining the parameters below them and ignored by the program.

#maximal read length

max_rd_len=250

#below starts a new library

[LIB]

#average insert size

avg_ins=550

#if sequence needs to be reversed put 1, otherwise 0

reverse_seq=0

#in which part(s) the reads are used. Flag of 1 means only contigs.

asm_flags=1

#use only first 250 bps of each read

rd_len_cutoff=250

#in which order the reads are used while scaffolding.Small fragments usually first, but worth playing with.

rank=1

# cutoff of pair number for a reliable connection (at least 3 for short insert size). Not sure what this means.

pair_num_cutoff=3

#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)

map_len=32

#a pair of fastq files, read 1 file should always be followed by read 2 file

q1=/share/apps/sharedData/GCAT/EukGenWorkshop16/3_soap/Ensifersp_R1_100X.fastq

q2=/share/apps/sharedData/GCAT/EukGenWorkshop16/3_soap/ Ensifersp_R2_100X.fastq

Use nano to look at the other config files, which will be used for the 100X and 10X datasets that were trimmed. How are they different from the "config.txt" file? Do you remember where you saw these k-mer values before?

Hit control-X to exit. Notice that the last two lines control the data input.

D. Run Soapdenovo using qsub. First use $nano to edit the file soap.qsub and edit the working directory, as in previous chapters. We will use the high memory (127mer) version of SOAPdenovo, a k-mer (-K) of 119, 8 processors (-p), the config file you just made (-s), and give all output files the prefix ensUnTrimmed100XK119 (-o).The last line of the qsub control file specifies these parameters.

#!/bin/bash --login

#PBS -N soapqsub

#PBS -j oe

#PBS -m abe

#PBS -M YOUR_EMAIL_ADDRESS

#PBS -q default

#PBS -l nodes=1:ppn=8,mem=20gb,walltime=01:00:00

module load SOAPdenovo2

workdir=/home/USERNAME/3_soap

cd $workdir

SOAPdenovo-127mer all -K 21 -p 8 -s config100XUntrimmed.txt -o ensUnTrimmed100XK21

SOAPdenovo run details

· SOAPdenovo-127mer allows k-mer size to vary up to 127

· K sets k-mer size (e.g. 21)

· p is the total cpus (must match #nodes * #ppn; e.g. 1*8=8)

· s identifies the configuration file

· o sets a prefix for all the output files for that run

E. Run qsub using

$qsub soap.qsub

F. Examine directory contents using ls

G. You will see a job identifier appear. Write the number down. This job will take a while to run. Check to see if it is running by typing:

$qstat -all

To see if your job is running, find your job number and look for an "R" in status. When the job is complete, you will see a "C." After a couple minutes, use "ls" to see if any output files have appeared. If not, there has probably been an input error.

To see what may have gone wrong, use ll to see the exact name of files, and use less to open up the "Soap_output" file that ends with your job number. This file will tell you the details of your qsub submission.

$ll

$less SOAP_output.your_job_number

Page down within the file to find the error message and try to troubleshoot using that information. To quit "less," press "q".

H. When SOAPdenovo finishes, examine the soap directory contents using ls. Each successful run of SOAPdenovo generates about 20 files, only a few of which contain information that we will use.

$ll to view the directory contents in detailed view

$less the *.scafSeq file to view it. It will have the scaffold result files. Does it look like you expected?

$tail –n 50 the *.scafSeq file. This will reveal the tail end of the file (last 50 lines), which contains the largest assembled sequences.

$less the *.scafStatistics file. This will view detailed information on scaffold assembly. Record the relevant information in the table below and compare with a group member.

Size_includeN:Total size of assembly in scaffolds, including Ns

Size_withoutN:Total size of assembly in scaffolds, not including Ns

Scaffold_Num:Number of scaffolds

Mean_Size:Mean size of scaffolds

Median_Size:Median size of scaffolds

Longest_Seq:Longest scaffold

Shortest_Seq:Shortest scaffold

Singleton_Num:Number of singletons

Average_length_of_break(N)_in_scaffold: Average length of unknown nucleotides (N) in scaffolds

Also contained will be counts of scaffolds above certain sizes, percent of each nucleotide and N (Gap) values, and "N statistics." Recall from the pre-tutorial chapters that N50 is the size of the smallest scaffold such that 50% of the genome is contained in scaffolds of size N50 or larger (Salzberg et al. 2012). The line beginning with "N50 35836 28" indicates that there are 28 scaffolds of at least 35836 nucleotides, and that they contain at least 50% of the overall assembly length. Statistics for contigs (pre-scaffold assemblies) are also shown.

I. Rerun your jobs at different k-mer sizes (see below) to explore how assembly is affected. Spread datasets among groups. While running check your KmerGenie results to see which k-mer size is expected to produce the best assembly for each dataset.

· To run a batch of jobs with different k-mer sizes, copy/paste the last line of the qsub script several times, adjusting k-mer size (–K option) and output file name (-o option) to match k-mer size and data file used. You can copy paste in Linux by selecting the text you want with a mouse, hitting control-C, and then right clicking in the area to paste.

· To run a batch of jobs on different input files containing different coverages, edit and save the config.txt file specifying the data sets to be analyzed. Each time you change k-mer size or input file name adjust the –o option in the qsub Script to change the names of the output files. For example, use "ensTrimmed100XK117" to label output files as resulting from the 100X data, with K-mer of 117 that is trimmed.

· Run the new qsub script.

Assessment (Part I)

Fill in the table below and cut/paste into your notebook observations/results. "Num_ Scaffs" refers to total number of scaffolds.

How did increasing k-mer size (21 to 119), increasing coverage (10X to 100x), and quality filtering (untrimmed vs trimmed) effect quality of assembly? Support your arguments citing specific results. Were the KmerGenie "Best K-mer" predictions useful?

Protocols (continued)

II. Genome assembly using MaSuRCA in the Linux environment

MaSuRCA (Maryland Super Read Cabog Assembler; Zimin et al 2013) uses a hybrid approach to genome assembly that combines the benefits of the de Bruijn graph and Overlap‐Layout‐Consensus (OLC) assembly methods. From Zimin et al. 2013:

"OLC methods compute all pairwise overlaps between reads based on sequence similarity. The algorithm then creates a layout with all overlapping reads from which a consensus sequence is obtained by scanning the multi-read alignment. This approach has flexibility with respect to read lengths and robustness to sequencing errors. 

The de Bruijn graph assembly algorithm creates a de Bruijn graph from the read data. K-mers from every read are assigned to directed edges in a graph connecting nodes. Paths in the graph that go through every edge once, known as Eulerian paths, build an assembly of the reads. Contrary to OLC methods, de Bruijn graph assembly methods are computationally efficient."

MaSuRCA can assemble whole genomes from short-read Illumina data or a combination of short reads and longer reads (like those from Sanger and 454 technologies). The MaSuRCA assembler creates a smaller number of "super-reads" from the many paired-end reads. These super-reads contain all of the sequence information from the original reads and allows for the use of long and short reads in the assembly. For the OLC component of the assembly, MaSuRCA uses a modified version of the CABOG assembler to create the super-reads using only reads that are not substrings of larger reads. The super-reads are built by extending each of the original reads base by base in the forward and reverse directions as long as there is no ambiguity with respect to which base comes next. If two of the same super-read are made, only one of them is kept, which greatly reduces the size of the dataset and makes the assembly less computationally intensive. The OLC assembler (CABOG) incorporates the mate-pair and super-read information into the assembly.

MaSuRCA has a built-in error corrector, QuorUM, which also corrects, trims and removes reads. In order not to throw out low coverage regions of the genome, it eliminates the k-mer cutoff step altogether, and instead questions a k-mer in a read when as one moves along the read, there is a sudden drop to low coverage (Marcais et al. 2013). If there is unambiguous support from other k-mers for a different nucleotide, it will correct the read. If there are too many corrections over a short space (3 corrections in 10 bp), the read is truncated before the three corrections. Note that the QuorUM error checker assumes uniform coverage of the genome, and so it is not recommended to use a Trimmomatic as a step prior to using MaSuRCA and QuorUM, so one would start a MaSuRCA assembly using data prior to quality filtering.

How to Run MaSuRCA1. First create a new "3_masurca" directory in your home directory and move into it. Copy a configuration file with the location of the compiled assembler, the data, and some parameters (using the default for all but the JF_SIZE). Copy a template qsub file as well.

$cd

$mkdir 3_masurca

$cd 3_masurca

$cp /share/apps/sharedData/GCAT/EukGenWorkshop16/3_MaSuRCA/config.txt .

$cp /share/apps/sharedData/GCAT/EukGenWorkshop16/3_MaSuRCA/masurca.qsub .

The configuration file does not need to be edited, but open it up with nano:

DATA

PE = pa 550 83 /share/apps/sharedData/GCAT/EukGenWorkshop/3_MaSuRCA/Ensifersp_R1_100X.fastq /share/apps/sharedData/GCAT/EukGenWorkshop/3_MaSuRCA/Ensifersp_R2_100X.fastq

END

PARAMETERS

GRAPH_KMER_SIZE=auto

USE_LINKING_MATES=1

NUM_THREADS=8

JF_SIZE=122529439

DO_HOMOPOLYMER_TRIM=0

END 

Data includes PE (paired-end type), a prefix (pa, but could be anything), the average read insert size (550), the standard deviation (83), and the two input files. Insert and SD statistics are estimated from sequencing library preparation gels/profiles.

Note JF_SIZE should be about 10x expected genome size.

2. Prepare the assembler by typing:

$module load MaSuRCA

$masurca config.txt

This generates a shell script called "assemble.sh" from the configuration file, which drives the assembly process.

3. In your qsub file below you will run the script 'assemble.sh' to start the assembly.

$nano masurca.qsub

#!/bin/bash --login

#PBS -N masurca_qsub

#PBS -j oe

#PBS -m abe

#PBS -M YOUR_EMAIL_ADDRESS

#PBS -q default

#PBS -l nodes=1:ppn=8

workdir=/home/USERNAME/3_masurca

cd $workdir

./assemble.sh

4. Run the qsub file:

$qsub masurca.qsub

***IF the assembly fails, it can be restarted by deleting any/all files that contain incorrect or truncated contents. Then, running:

$masurca config.txt

in the assembly directory will create a new 'assemble.sh' script that accounts for the files that are already present and checks for all dependencies so that it only runs the steps that need to be run. Then running "assemble.sh" as before restarts the assembly.

Monitor the assembly's progress by looking at the qsub output file created in your working directory.

Example

$less AW1_Ensifer_MaSuRCA.o8649

processing pe library reads Thu Apr 9 14:43:43 EDT 2015

Average PE read length 194

choosing kmer size of 31 for the graph

running Jellyfish Thu Apr 9 14:44:19 EDT 2015

cat: write error: Broken pipe

MIN_Q_CHAR: 33

Error correction Poisson cutoff = 4

error correct PE Thu Apr 9 14:48:02 EDT 2015

Estimated genome size: 6661646

computing super reads from PE Thu Apr 9 14:51:02 EDT 2015

Linking PE reads 9206

Celera Assembler Thu Apr 9 14:51:47 EDT 2015

ovlMerThreshold=75

Overlap/unitig success

Overlap/unitig success

Unitig consensus success

CA success

Gap closing Thu Apr 9 14:58:54 EDT 2015

Gap close success. Output sequence is in CA/10-gapclose/genome.{ctg,scf}.fasta

All done Thu Apr 9 15:09:00 EDT 2015

FYI, The 10x Ensifer (7Mbp genome) assembly on 4 processors took over an hour to finish.

6. The last step of the assembly produces a "CA/10-gapclose/" directory which contains the genome in scaffolds (genome.scf.fasta). Run QUAST from within the "CA/10-gapclose/" directory to obtain assembly statistics:

$module load QUAST4.1

$quast.py -o quast_Ensifer -m 200 -l MaSuRCA genome.scf.fasta

Note: -o is the name of the output directory, -m will set the minimum size contig used for statistical summaries at 200, –l is the label (usually the name of the assembler), and the final item is the genome FASTA file from MaSuRCA.

7. Move into the QUAST output directory:

$cd quast_Ensifer/

8. View the assembly statistics:

$less report.txt

All statistics are based on contigs of size >= 500 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs).

Assembly MaSuRCA

# contigs (>= 0 bp) 576

# contigs (>= 1000 bp) 158

# contigs (>= 5000 bp) 84

# contigs (>= 10000 bp) 74

# contigs (>= 25000 bp) 53

# contigs (>= 50000 bp) 41

Total length (>= 0 bp) 7246203

Total length (>= 1000 bp) 7054194

Total length (>= 5000 bp) 6910275

Total length (>= 10000 bp) 6832000

Total length (>= 25000 bp) 6449327

Total length (>= 50000 bp) 6007872

# contigs 281

Largest contig 411657

Total length 7132884

GC (%) 61.99

N50 155092

N75 71549

L50 15

L75 31

# N's per 100 kbp 2.80

Note L50 refers to number of scaffolds equal to or above the N50 size

Note that compared to SOAPdenovo, this program throws out all contigs <500 bp, whereas SOAPdenovo throws out contigs <200bp.

9. Now we can run the assembly on the the 100X trimmed data. Follow steps 2-7 again, but we will make a new directory within your "3_masurca" directory so that it doesn't overwrite your previous assembly.

$mkdir paired100xTrimmed

$cd paired100xTrimmed

$cp /share/apps/sharedData/GCAT/EukGenWorkshop16/3_MaSuRCA/assembly_2/config.txt .

$cp /share/apps/sharedData/GCAT/EukGenWorkshop16/3_MaSuRCA/assembly_2/QsubFile.sh .

Report results of the MaSuRCA assembly of the 100x untrimmed and trimmed data. What differences do you see compared to the SOAP assemblies? Explain.

10. MaSuRCA outputs many GB of result files, so we will need to delete them when you are done, after returning to your home directory.

$cd

$rm -rf 3_masurca

III. Reference guided assembly via Scaffold Builder

Let's perform a reference guided assembly using the reference sequence. Please refer to Silva et al. (2013) for details.

>ref

GTATCTGCCATTTCTGAAATCTCTGCGCTAAATGAAGGAATAACATGTTAGATCTGATGTCATTCAATGTTTGATATATTTTACACAAAAAACATTATATGGTAACTTACAGTGTTCGAATTACACTATAGCTTCCGGGAAAGTTCTATTTTTCAGGGATGGGAGGGTACTGTACACAGTACTGTACTGTACTTTGTAATTGTAATCTATAGTGGTTGTTTGCATAAAAACAGACAATTCAAATGATTATGATTATTTAAATATTTGCTCTGATTAAACAAATCTTGGCAAGCTGCTTAGGGCTAGTGTTAGGAAGATAAACAGGTCATAATTCCCTCTCTAATATCTATTTTATATTGCTGTGAAAACATCTTCAAACAACTGGATTTTGAACACCTCATAATAATAACTCAATGGCATCTCTGTTAACGTTAGTAGCTCCATTATTTAGCCAAGCAGGACTGTGCTGCTCACCGGCTGCCAACAAGTAACGTCACTAACTCTCCTCCTGTTGATTTCCACATGGTTTTGTTCTTCACAGTGTAGCATTATCAGGGAAAAAATAGGGTGCGCCCCCTCAGGTTTAGTAGCGCCACGCTGTTGAGCGCTTTTTTTTTCTCCACCGTGTCTCCGGTCCCAAACAAACTGAAACATGATAAAGAGTGTGTATGCGTCATTGTGCATGCGTAACGAAAGCATCAATACAGAGGAAACAAAAGTCACGGGGGCATGAGATGCCAAGGACAACTAGGAACCCGTTCTCTATTCCCATCTCTAGCGTTTTCCCTTCCTGCCAAAGTGGTTGATGGCCTGACGATTTTACCCGCCACTGCCAACAATTTATCCGCATTTGGCGGGTGGCGGTTGCTAATTTCAGACCCTGGATATACAGTATGTATATAGATGGACGACGCGTCTCCACTTCCTCCCACTGTACAAAAGTGAAGTCAAAATATTCCGGATACGGGAGCTGCCATCTTGAGATTTTAACGTCATTTGGAGCCAGAGTCTGCGCAGTAGTGATCGGGTGGCCGAGCAGCGGTATCGATG

and the following three fragments from it…

>frag1

GTATCTGCCATTTCTGAAATCTCTGCGCTAAATGAAGGAATAACATGTTAGATCTGATGTCATTCAATGTTTGATATATTTTACACAAAAAACATTATATGGTAACTTACAGTGTTCGAATTACACTATAGCTTCCGGGAAAGTTCTATTTTTCAGGGATGGGAGGGTACTGTACACAGTACTGTACTGTACTTTGTAATTGTAATCTATAGTGGTTGTTTGCATAAAAACAGACAATTCAAATGATTATGATTATTTAAATATTTGCTCTGATTAAACA

>frag2

CTCTGTTAACGTTAGTAGCTCCATTATTTAGCCAAGCAGGACTGTGCTGCTCACCGGCTGCCAACAAGTAACGTCACTAACTCTCCTCCTGTTGATTTCCACATGGTTTTGTTCTTCACAGTGTAGCATTATCAGGGAAAAAATAGGGTGCGCCCCCTCAGGTTTAGTAGCGCCACGCTGTTGAGCGCTTTTTTTTTCTCCACCGTGTCTCCGGTCCCAAACAAACTGAAACATGATAAAGAGTGTGTATGCGTCATTGTGCATGCGTAACGAAAGCATC

>frag3

ACGTCACTAACTCTCCTCCTGTTGATTTCCACATGGTTTTGTTCTTCACAGTGTAGCATTATCAGGGAAAAAATAGGGTGCGCCCCCTCAGGTTTAGTAGCGCCACGCTGTTGAGCGCTTTTTTTTTCTCCACCGTGTCTCCGGTCCCAAACAAACTGAAACATGATAAAGAGTGTGTATGCGTCATTGTGCATGCGTAACGAAAGCATCAATACAGAGGAAACAAAAGTCACGGGGGCATGAGATGCCAAGGACAACTAGGAACCCGTTCTCTATTCCCATCTCTAGCGTTTTCCCTTCCTGCCAAAGTGGTTGATGGCCTGACGATTTTACCCGCCACTGCCAACAATTTATCCGCATTTGGCGGGTGGCGGTTGCTAATTTCAGACCCTGGATATACAGTATGTATATAGATGGACGACGCGTCTCCACTTCCTCCCACTGTACAAAAGTGAAGTCAAAATATTCCGGATACGGGAGCTGCCATCTTGAGATTTTAACGTCATTTGGAGCCAGAGTCTGCGCAGTAGTGATCGGGTGGCCGAGCAGCGGTATCGATG

1. Move to your home directory. A reference guided assembly using the program Scaffold Builder would be performed like this:

$cd

$mkdir 3_scaffBuilder

$cd 3_scaffBuilder

$cp /share/apps/sharedData/GCAT/EukGenWorkshop16/3_scaffoldBuilder/* .

Edit the qsub file for your working directory and then run it.

$nano scaffBuilder.qsub

$qsub scaffBuilder.qsub

#!/bin/bash --login

#PBS -N scaffBuilder_test

#PBS -j oe

#PBS -m abe

#PBS -q default

#PBS -l nodes=1:ppn=1

module load MUMmer3.23

module unload python

module load pythonShared

workdir=/home/USERNAME/3_scaffBuilder

cd $workdir

python ./scaffold_builder.py -q test.fasta -r ref.fasta -p RefGuided

OUTPUT

>Scaffold_1

GTATCTGCCATTTCTGAAATCTCTGCGCTAAATGAAGGAATAACATGTTAGATCTGATGTCATTCAATGTTTGATATATTTTACACAAAAAACATTATATGGTAACTTACAGTGTTCGAATTACACTATAGCTTCCGGGAAAGTTCTATTTTTCAGGGATGGGAGGGTACTGTACACAGTACTGTACTGTACTTTGTAATTGTAATCTATAGTGGTTGTTTGCATAAAAACAGACAATTCAAATGATTATGATTATTTAAATATTTGCTCTGATTAAACANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCTCTGTTAACGTTAGTAGCTCCATTATTTAGCCAAGCAGGACTGTGCTGCTCACCGGCTGCCAACAAGTAACGTCACTAACTCTCCTCCTGTTGATTTCCACATGGTTTTGTTCTTCACAGTGTAGCATTATCAGGGAAAAAATAGGGTGCGCCCCCTCAGGTTTAGTAGCGCCACGCTGTTGAGCGCTTTTTTTTTCTCCACCGTGTCTCCGGTCCCAAACAAACTGAAACATGATAAAGAGTGTGTATGCGTCATTGTGCATGCGTAACGAAAGCATCAATACAGAGGAAACAAAAGTCACGGGGGCATGAGATGCCAAGGACAACTAGGAACCCGTTCTCTATTCCCATCTCTAGCGTTTTCCCTTCCTGCCAAAGTGGTTGATGGCCTGACGATTTTACCCGCCACTGCCAACAATTTATCCGCATTTGGCGGGTGGCGGTTGCTAATTTCAGACCCTGGATATACAGTATGTATATAGATGGACGACGCGTCTCCACTTCCTCCCACTGTACAAAAGTGAAGTCAAAATATTCCGGATACGGGAGCTGCCATCTTGAGATTTTAACGTCATTTGGAGCCAGAGTCTGCGCAGTAGTGATCGGGTGGCCGAGCAGCGGTATCGATG

Optional Activity

1. Make the example more realistic by

a. Adding a separate FASTA sequence to the multi-FASTA group that does not align to reference.

b. Appending some sequence that does not align to reference to frag_2 above.

Time line of chapter

Four hours

Discussion

1. Look at the SOAPdenovo manual to see how to include both a regular paired-end and mate-pair library. Use the information from the original web page to determine library properties. http://gage.cbcb.umd.edu/data/Staphylococcus_aureus/Data.original/readme.txt

References

Compeau PEC, Pevzner PA, Tesler G. 2011. How to apply de Bruijn graphs to genome assembly. Nature Biotechnology. 29:987-991.

Li R, Zhu H, Ruan J et al. 2010. De novo assembly of human genomes with massively parallel short read sequenceing. Genom Res. 20:265-272 [SOAPdenovo 1]

Luo R, Liu B, Xie Y, et al. 2012. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1:18. [important online supplements]

Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen Y-J, Chen Z et al . 2005. Genome sequencing in open fabricated high density picoliter reactors. Nature 437: 376–380.

Miller JR, Koren S, Sutton G. 2010. Assembly algorithms for next-generation sequencing data. Genomics 95: 315-327.

Salzberg SL, Phillippy AM, Zimin A, et al. 2012. GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res 22: 557-567. [important online supplements!]

Silva GZ et al. 2013. Combining de novo and reference-guided assembly with scaffold_builder.Sourve code for biology and medicine 8(1):1.

Zimin AV, Marçais G, Puiu D, Roberts M, Salzberg SL, Yorke JA. The MaSuRCA genome assembler. Bioinformatics. 2013. Nov 1;29(21):2669-77.

Chapter 4. Genome annotation with Maker I: Overview and Repeat Finding

Background

Gene Annotation Overview

After genome assembly, the process of annotating genes requires several steps.

· Novel repeats need to be identified from the novel genome with a program like Repeat Scout to prevent genes originating from viruses from being called as rockfish genes. A novel library of repeat sequences is made. A repeat library is a FASTA formatted file listing names of repeated sequences followed by the sequence.

· Repeats in genomic DNA are masked (changed into "N"s or lower case) by RepeatMasker with that library. Repeat masking is important to prevent inserted viral exons from being counted as fish exons.

· Extrinsic evidence is gathered. The best data for guiding gene annotation is transcriptome data from the target species. Lacking that data, one can perform an "entrez search" at NCBI to see what kind of genomic resources are available from your organism or a closely related species. One can also download entire proteomes from model species from various databases.

· Gene predictors need to be trained for the new organism. Because "what a gene looks like" differs significantly among genomes, gene finders must be "trained" to know what the signals look like in each new species sequenced. Gene predictors such as SNAP and Augustus predict genes by finding intrinsic gene signals in DNA like start/stop codons and intron/exon borders. The gene summary "hmm files" are a summary of what a typical gene looks like in a species, and are comprised of files that end with the extension "hmm". To train the gene predictors, a draft Maker run is performed (more detail on training in chapter 5).

· After masking repeats, gathering evidence, and training gene predictors, Maker can be run on entire genomes. The Maker program is a genome annotation pipeline (Cantarel et al. 2008; Holt and Yandell 2012; figure below) that combines two kinds of information to make structural gene annotations from raw DNA: i) extrinsic "evidence", which is evidence supplied to Maker based on similarity of genomic regions to other organisms' mRNAs (also known as expressed sequence tags, or ESTs) or protein sequences and ii) gene predictions from intrinsic signals in the organism's DNA found by ab-initio (from scratch) gene predictors. Intrinsic evidence is evidence such as start and stop codons and intron-exon boundaries predicted from gene predictors. The two gene predictors used here by Maker are SNAP and Augustus. Gene predictions can be also made directly from EST and protein evidence using Est2Genome and Protein2Genome, which edit ends of BLAST hits for gene sequences. Maker takes the entire pool of ab initio and evidence-informed gene predictions, updates features such as 5' and 3' UTRs based on EST evidence, tries to determine alternative splice forms where EST data permits, and chooses from among all the gene model possibilities the one that best matches the evidence. Annotations are only produced if they are supported by both BLAST alignments and a gene predictor.

· Gene annotations are stored in GFF3 file format

General Feature Format (GFF3) is a standard file format, meaning that it is supported and used by several programs and appears in the same form regardless of what program generated it. GFF is a tab delimited file featuring nine columns of sequence data.

Column

Feature

Description

1

seqname

The name of the sequence

2

source

The program that created the feature

3

feature

What the sequence is (gene, contig, exon, match)

4

start

First base in the sequence

5

end

Last base in the sequence

6

Score

Score of the feature

7

strand

+ or -

8

frame

0,1, or 2 first, second, or thrid frame

9

attribute

List of miscellaneous features not covered in the first 8 columns, each feature separated by a ;

Example gff (Shows part of a single exon Maker gene found on scaffold 1)

Repeat Masking

A large fraction of the eukaryotic genomes consists of dispersed repeats, including SINEs (short interspersed nuclear elements, ~350bp) and LINEs (long intersperses nuclear elements, ~6000bp). Both SINEs and LINEs may contain coding genes from exogenous sources such as retroviruses. These genes need to be identified as exogenous to keep gene finders from confusing them with endogenous genes necessary for the function of the organism.

· The program RepeatScout can be used to find DNA elements repeated more than 10X in the genome and make a draft repeat library.

· The program TEclass characterizes repeat types and labels the repeat library.

· To avoid confusing gene families within the endogenous DNA with repeats from exogenous DNA, unlabeled "unclear" repeats from TEclass are pulled out of the library. Repeats representing members of gene families are identified by Blast search against protein databases, and removed from the repeat library.

· The resulting repeat libraries are combined with known repeats from other organisms for comprehensive repeat identification.

· Once repeats are identified, the genome is searched for these repeats and matching sections are "masked" by converting to lower case (soft masked) or to N (hard masked) in order to prevent gene finders from calling them genes.

RepeatScout

RepeatScout (Price et al. 2005) assembles a custom repeat library for use in RepeatMasker. RepeatScout counts every 12 base pair sequence, then extends those 12 base pair sequences to form consensus sequences. Simple and tandem repeats are then filtered from the preliminary repeat library. The filtered repeat library is then used in RepeatMasker to mask the genome. Repeats that are found less than ten times within the entire genome are then also removed. To ensure that conserved protein families are not miss-labeled as repeats, BLAST2GO (Conesa et al. 2005) is used to determine similarity between the putative repeats and any known proteins. Finally, repeats are classified using TEclass (Abrujan et al. 2009).

Learning outcomes

You will find and characterize repetitive elements specific to the novel genome to be masked in gene annotation.

Vision and Change core competencies addressed

Ability to use quantitative reasoning: Applying statistical methods to diverse data, Mathematical modeling, Managing and analyzing large data sets

GCAT-SEEK sequencing requirements

None

Computer/program requirements for data analysis

Linux OS, RepeatMasker, RepeatScout, Integrated Genome Viewer

Protocols

Repeat Scout

RepeatScout is run in several steps using several different programs. This tutorial assumes you are using a FASTA file named "genome.fa". You could, however, use any FASTA file by simply replacing "genome.fa" with the name of the FASTA file being used. Also, running any of the listed commands without any additional input (or running them with the -h option) will display the help message including a description of the program, the usage, and all options.

1) From your home directory, make a new directory called "4_repeatScout", and copy a FASTA genome and other files including a base qsub file (repeat1.qsub) into the folder.

$cd

$mkdir 4_repeatScout

$cd 4_repeatScout

$cp /share/apps/sharedData/GCAT/EukGenWorkshop16/4_repeatScout/* .

$ll

2) The provided qsub script will run a series of analyses that we discuss step by step below. The qsub script will load RepeatScout, RepeatMasker, nseg and perl modules. Make sure to edit your working directory path using nano (highlighted below), then run the qsub script like normal.

$nano repeat1.qsub

$qsub repeat1.qsub

________________________________________________________________

#!/bin/bash --login

#PBS -N repeatqsub

#PBS -j oe

#PBS -m abe

#PBS -M YOUR_EMAIL_ADDRESS

#PBS -q default

#PBS -l nodes=1:ppn=8,mem=20gb,walltime=01:00:00

module load RepeatMasker

module load RepeatScout

module load perl-5.14.2

module load nseg

workdir=/home/USERNAME/4_repeatScout

cd $workdir

build_lmer_table -sequence genome.fa -freq genome.freq

RepeatScout -sequence genome.fa -freq genome.freq -output genome.firstrepeats.lib

cat genome.firstrepeats.lib | filter-stage-1.prl > genome.filtered.lib

RepeatMasker -pa 8 -lib genome.filtered.lib -nolow genome.fa

cat genome.filtered.lib | filter-stage-2.prl --cat genome.fa.out > genome.secondfilter.lib

_____________________________________

YOU DON’T NEED TO RE-RUN THE FOLLOWING COMMANDS. THESE WERE ALREADY RUN IN THE QSUB SCRIPT ABOVE

The following line will count every 12 base pair sequence (motif) in the genome and output a file genome.freq, that will contain a table of motif counts.

build_lmer_table -sequence genome.fa -freq genome.freq

The following line will run RepeatScout to extend the 12 base pair sequences to form initial consensus repeats. This will take a minute. (Note that if you have a large >50Mbp genome, you may want to increase run time (walltime) to 1000 minutes). It produces a file "genome.firstrepeats.lib".

RepeatScout -sequence genome.fa -freq genome.freq -output genome.firstrepeats.lib

The following line will filter out simple and tandem repeats from the library, and produce "genome.filtered.lib".

cat genome.firstrepeats.lib | filter-stage-1.prl > genome.filtered.lib

The next line will run RepeatMasker to count the number of times each repeat in the filtered library appears in the genome. The output file "genome.fa.out" describes each repeat and "genome.fa.tbl" will contain a column with the count of each repeat type. This command uses 8 processors.

RepeatMasker -pa 8 -lib genome.filtered.lib -nolow genome.fa

*The -pa option tells RepeatMasker how many processors to use. If you are only using one processor, do not use the -pa option. If you were running RepeatMasker on 3 processors, you would use -pa 3.

**The -nolow option stops RepeatMasker from masking low complexity repeats. Since we are only concerned with the number of times the generated repeats appear, masking low complexity repeats simply adds more time.

The next line will filter out any repeats that appear less than 10 times. Note, because our sample data set is so small, not much will be repeated 10 times, so check the results when the first command finishes.

cat genome.filtered.lib | filter-stage-2.prl --cat genome.fa.out > genome.secondfilter.lib

The main output file is genome.secondfilter.lib. Take a look at it using less.

$less genome.secondfilter.lib

Skip 3 to 10 below for the practice data

3) Now use the program TEclass to label each repeat type. For repeats labelled ‘unclear’, we will use NCBI BlastX to identify members of legitimate gene families from "genome.secondfilter.lib". TEclass (Abrusan et al. 2009) will label unknown eukaryotic transposable repeat elements in the FASTA repeat library by their mode of transposition (DNA transposons, LINEs, SINEs, LTRs; http://www.bioinformatics.uni-muenster.de/tools/teclass/?lang=en&mscl=0&cscl=0). From the manual:

"We analyze repeats in different size categories: 0-600 bp, 601-1800 bp, 1801-4000 bp, >4000 bp, and built independent classifiers for all these length classes. We use libsvm as the SVM engine, with a Gaussian kernel. The classification process is binary, with the following steps: forward versus reverse sequence orientation > DNA versus Retrotransposon > LTRs versus nonLTRs (for retroelements) > LINEs versus SINEs (for nonLTR repeats). The last step is performed only for repeats with lengths below 1800 bp, because we are not aware of SINEs longer than 1800 bp. Separate classifiers were built for each length class and for each classification step. If the different methods of classification lead to conflicting results, TEclass reports the repeat either as unknown, or as the last category where the clasification methods are in agreement."

The wiki description of transposon repeat classification is at: http://en.wikipedia.org/wiki/Transposon.

4) Go to http://www.bioinformatics.uni-muenster.de/tools/teclass/?lang=en&mscl=0&cscl=0

On the left side of the screen select TEclass/Start Application.

5) Use Cyberduck to move "genome.secondfilter.lib" to your laptop. Click "choose file" and upload "genome.secondfilter.lib"

6) Click run.

7) A "Statistics" result link will give a summary of the types of elements found. A "library" link will provide a downloadable FASTA file. Download and copy the file back into the Linux OS: since it is a small amount of information, it is easier to open a new document in the Linux OS using nano, return to windows, and use Control-A (command-A for mac) to select all, Control-C to copy, and hit right click back into the nano document to paste, then save as "input.lib". Results are comprised of a FASTA file of repeats labeled with their category, named "input.lib". Use grep to pull out "unclear" repeats for BlastX analysis against teleost proteins using NCBI (Blast2GO is slow, but is explained below FYI). The command below finds "unclear" in a FASTA header and grabs the line under it too using the option "-A1".

$grep -A1 "unclear" input.lib > unclear

8) Transfer the "unclear" repeats file back to your desktop; go to the NCBI BlastX page; upload the unclear repeats and adjust parameters. Search against the "nr" database and restrict organism to "teleostei." Restrict "max number of sequences" to 10, adjust the matrix to BLOSUM 80 (used to find matches to closely related species), change the E-value (i.e. "Expect Threshold") to "1e-6". When finished, there will be a results pull down box showing if any repeats had significant hits. (There may be no hits with a very small library.) Once you’ve identified repeats from real genes, write down the names, go back to Linux and delete them from "input.lib" using nano. Repeat the BlastX search with the entire initial library from step 7 to make sure no protein families were in the dataset.

9) So that RepeatMasker can read the identity of the repeats from TEclass, we will need to edit the FASTA header and move pieces around using special Find/Replace techniques.

We need to go from:

"test.lib"

>R=1 (RR=2. TRF=0.000 NSEG=0.000)|TEclass result: unclear|forward

TTAGGTTTAGGCAACAAAACTACTTAGTTAGGTTTAGGAAAAAATCATGGTTTGGGTTAAAATAACT

>R=10 (RR=10. TRF=0.175 NSEG=0.000)|TEclass result: DNA|reverse complemented

TTATTACACGGCTTTGTTGAATACTCGATTCTGATTGGTCAATCACGGCGTTCTACGGTCTGTTA

To:

"test.lib"

>R=1#unclear

TTAGGTTTAGGCAACAAAACTACTTAGTTAGGTTTAGGAAAAAATCATGGTTTGGGTTAAAATAACT

>R=10#DNA

TTATTACACGGCTTTGTTGAATACTCGATTCTGATTGGTCAATCACGGCGTTCTACGGTCTGTTA

· Complete the perl-pie tutorial on the following page and then return here.

$perl -p -i.bak -e 's/^(>R=\d+)(.+:\s+)(\w+)(\|.+)/\1#\3/g' test.lib

Below you see how different segments of the header were captured by \1, \2, and \3. Note that we are deleting \2 by not putting it in the "replace" section.

{>R=1}{ (RR=2. TRF=0.000 NSEG=0.000)|TEclass result: }{unclear} {|forward}

\1 \2\3\4

10. Make sure it works on the test file, and convert your TEclass library file named "input.lib".

11. Edit the qsub script "repeat1.qsub" from step (2) above to mask the genome file with your final repeat library ("input.lib" from step (10); or " genome.secondfilter.lib" if you skipped steps 3 to 10). Open "repeat1.qsub" with nano, delete each line after "cd $workdir" using control-k, replace that text with the following text, and save with a new name "repeat2.qsub" to keep your earlier work. Then run the qsub script like normal. Check out what the "-x small" option does by loading the RepeatMasker module on the head node and typing "module load RepeatMasker", then "RepeatMasker -h". Note that when repeat masking on a full genome you will want to concatenate your novel repeats with a repeat library from the nearest model organism.

RepeatMasker -pa 8 -x small -lib genome.secondfilter.lib genome.fa

12. To view the length and frequency of repeats, open "genome.fa.tbl". Total length, %GC, and percent bases masked (i.e. percent in repeats) are shown. Also shown are number and total length occupied by different kinds of repeats.

13. Make a copy of the genome file using "cp genome.fa genome2.fa", then edit the qsub file again to use a model organism's repeat library to mask repeats. Save the edited qsub file as repeat3.qsub

RepeatMasker -pa 8 -x small -species fugu genome2.fa

14. Describe and explain differences between genome1.fa.tbl and genome2.fa.tbl

Perl-pie Tutorial

We will use a ‘perl-pie’ to make the edits from the Linux command line. In general, perl-pies work like this:

$perl -p -i.bak -e 's/find/replace/g' filename

-p and -e tell perl that this is an executable process. -i.bak creates a backup file

· Find and replace uses regular expression degeneracies (also see table below):

\d represents digits, \s whitespace, \w words, + represents more than one of the preceding type of character. "." represents any character and "^" is the start of a line.

Some regular expressions:

*

Wildcard, any character

.

Any character other than new line (\n)

\t

Tab

\s

Whitespace (space, Tab, new line)

\d

Any digit

^

Beginning of line

$

End of line

\n

Carriage return

See also: www.cheatography.com/davechild/cheat-sheets/regular-expressions/

regexlib.com/CheatSheet.aspx

· Find captures text within parenthesis marks () and replaces the found values using the order in which they were detected with \1\2 etc.

· Edit the following file "test.txt", which contains excerpts from a memorable dialog between my 5th grade son and one of his classmates after he had to cut off e-mail correspondence to go to baseball practice.

Juneleese: What did I do to deserve this?

James:

Juneleese is spelled wrong. Let's change it to Junelyse, while creating a backup file called test.txt.1bak

perl -p -i.1bak -e 's/Juneleese/Junelyse/g' test.txt

head test.txt

head test.txt.1bak

Now, let’s delete everything that Junelyse said. In the commands below, the "." is a regular expression represents any character with the exception of a line ending character. "\1" represents the text captured in the first (and only in this example) set of parentheses. The beginning and end of lines are represented by ^ and $, respectively. Including them helps perl capture the right text in complicated find/replace commands.

perl -p -i.2bak -e 's/^(Junelyse:).+$/\1/g' test.txt

head test.txt

Now, let’s suggest they both just say "Hi". Below "\w" represents any letter character and "+" any number of consecutive letters.

perl -p -i.3bak -e 's/^(\w+:)$/\1Hi/g' test.txt

head test.txt

Let’s change Hi to Hello

perl -p -i.4bak -e 's/Hi/Hello/g' test.txt

head test.txt

Let’s delete the first two characters of each line, just for fun

perl -p -i.5bak -e 's/^..//g' test.txt

head test.txt

Let’s say that somewhere along the line we made a mistake and want to start over. We will write over the latest file with the first backup and delete the other backups

mv test.txt.1bak test.txt

rm *bak

head test.txt

· Edit the following file test.fa

>scaffold_1

GCaTA

GCATg

GCAtC

>scaffold_2

G