creating and using genome assemblies tutorial...to the same coordinate system. the value of the...

Creating and Using GenomeAssemblies Tutorial

Release 8.1

Golden Helix, Inc.

March 18, 2014

Contents

1. Create a Genome Assembly for Danio rerio 2

2. Building Annotation Sources 5A. Creating a Reference Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7B. Creating a Gene Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9C. Visualizing the Annotation Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3. Create a ‘Fake’ Genome Assembly for any Grouping Variable 12

i

Creating and Using Genome Assemblies Tutorial, Release 8.1

Updated: February 26th, 2014

Level: Advanced

Packages: All Packages of SVS

Currently there are several available genome assemblies within SVS 8, including the human, cattle and soybeangenomes. If you go to Tools >Manage Genome Assemblies and select Download from Golden Helix you willsee the assemblies that are currently available for use in SVS 8.

But what if you are studying Zebrafish and you find that there is no genome assembly available for the most recentbuild (Zv9) of the species Danio rerio (Zebrafish)? Well if you have the necessary information available or youare willing to locate it independently, you will find that it is simple and straightforward to create your own genomeassembly in SVS. Keep in mind that locating the information can be difficult and there is no hard and fast rule toaccomplish this. Let’s first work through the process using the Zebrafish research scenario, then we will look at howyou can create your own ‘fake’ genome assembly for any grouping variable.

Requirements

To complete this tutorial you will need the following:

DownloadObesity.zip

We hope you enjoy the experience and look forward to your feedback.

Contents 1

http://www.goldenhelix.com/Downloads/login_noheader.html?product=SVS&view=http://doc.goldenhelix.com/SVS/tutorials/create_genome_assembly/Obesity.zip&iframe=true&width=580&height=400

1. Create a Genome Assembly for Daniorerio

To create a genome assembly for any species you need a listing of the assembled chromosomes and/or scaffolds alongwith their corresponding lengths for the particular genome build.

Note: If you will also be creating a Reference Sequence for your species the assembly file can be created automaticallyin SVS8 by the Covert Source Wizard using the FASTA reference file to determine the lengths of each definedsegment. You can skip to Part 2 of this tutorial for an example.

If you were to google ‘Zebrafish genome’ you may find http://uswest.ensembl.org/Danio_rerio/Info/Index this page.If you click on the More information and statistics link under the Genome assembly: Zv9 header and then click theGenBank Assembly ID GCA_000002035.2 link you will be directed to the NCBI website for this genome assembly.Under the Assembly Statistics tab you will see a listing of the assembled chromosomes with the corresponding lengthsfor this species.

Figure 1. Danio rerio genome

• Open a new project in SVS or open a project you already have that contains genomic information for Zebrafish.From the project navigator, choose Tools > Manage Genome Assemblies or type Ctrl + G. Click User GenomeAssemblies Folder. If you have not created any assemblies the folder should be empty.

• Right-click in the empty folder and select New > Text Document to create an empty text file and name it appro-priately with species name and genome build. For this example choose the name Danio_rerio_Zv9.assembly.

2

http://doc.goldenhelix.com/SVS/tutorials/create_genome_assembly/annotationSources.html

http://uswest.ensembl.org/Danio_rerio/Info/Index

http://www.ncbi.nlm.nih.gov/assembly/GCF_000002035.4/


Open the text file that you just created.

• The header lines of the file are a summary of the genome build information for this species along with relevantbuild dates and should look as follows.

{"coordinates" : "Zv9,Chromosome,Danio rerio","build" : "Zv9",

"common" : [ "Zebrafish", "bony fishes" ],"taxID" : "7955","genBankID" : "GCA_000002035.2","refSeqID" : "GCF_000002035.4","date" : "2010-07-15",

"modified" : "2013-12-30",

• The value of the build attribute is a way for SVS to refer to your genome assembly within the program. Thename should be unique as it is generally only used for disambiguation when multiple genome assemblies applyto the same coordinate system. The value of the coordinates attribute is used to identify the coordinate systemyour genome assembly refers to. The value is formatted according to the Distributed Annotation System (DAS)specification in three parts separated by commas: authority,type,species. The authority in this case matches thegenome build, Zv9. For best results you should find the appropriate entry in the DAS Registry coordinate systemlist.

Note: If the assembly you are working with is not represented in the list, something reasonable can be invented.Keep in mind that if the value of the coordinates attribute is user invented, it may not match annotations and other dataavailable elsewhere.

• The optional entries for common, taxID, genBankID, and refSeqID may be provided to help SVS categorizeyour genome assembly. Multiple common entries may be provided with different values for different commonnames for the species. The taxID refers to the unique Taxonomy ID which is assigned to each genome build andcan be found for the zebrafish assembly by clicking the Taxonomy link under the Related Information headeron the right side of the NCBI assembly page, you will be directed to the Taxonomy Browser website.

• The date entry provides the release date for the represented assembly and the modified date represents the currentdate. These values must be formatted YYYY-MM-DD.

• The remainder of the assembly file contains segment information, this will be either chromosome or scaffoldinformation depending on the species. For the case of the Zebrafish genome there will be a row listed for eachchromosome along with its corresponding length in base pairs, as shown earlier from the main NCBI webpagefor this genome build (Figure 1). Add the chromosome information with the correct formatting including allcommas, quotation marks and brackets as follows:

{"coordinates" : "Zv9,Chromosome,Danio rerio","build" : "Zv9",

"common" : [ "Zebrafish", "bony fishes" ],"taxID" : "7955","genBankID" : "GCA_000002035.2","refSeqID" : "GCF_000002035.4","date" : "2010-07-15",

"modified" : "2013-12-30",

"segment" :[

3

http://www.dasregistry.org/das1/coordinatesystem

http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?ID=7955i.


{ "name" : [ "1" ], "length" : 60348388, "type" : "autosome" },{ "name" : [ "2" ], "length" : 60300536, "type" : "autosome" },{ "name" : [ "3" ], "length" : 63268876, "type" : "autosome" },{ "name" : [ "4" ], "length" : 62094675, "type" : "autosome" },{ "name" : [ "5" ], "length" : 75682077, "type" : "autosome" },{ "name" : [ "6" ], "length" : 59938731, "type" : "autosome" },{ "name" : [ "7" ], "length" : 77276063, "type" : "autosome" },{ "name" : [ "8" ], "length" : 56184765, "type" : "autosome" },{ "name" : [ "9" ], "length" : 58232459, "type" : "autosome" },{ "name" : [ "10" ], "length" : 46591166, "type" : "autosome" },{ "name" : [ "11" ], "length" : 46661319, "type" : "autosome" },{ "name" : [ "12" ], "length" : 50697278, "type" : "autosome" },{ "name" : [ "13" ], "length" : 54093808, "type" : "autosome" },{ "name" : [ "14" ], "length" : 53733891, "type" : "autosome" },{ "name" : [ "15" ], "length" : 47442429, "type" : "autosome" },{ "name" : [ "16" ], "length" : 58780683, "type" : "autosome" },{ "name" : [ "17" ], "length" : 53984731, "type" : "autosome" },{ "name" : [ "18" ], "length" : 49877488, "type" : "autosome" },{ "name" : [ "19" ], "length" : 50254551, "type" : "autosome" },{ "name" : [ "20" ], "length" : 55952140, "type" : "autosome" },{ "name" : [ "21" ], "length" : 44544065, "type" : "autosome" },{ "name" : [ "22" ], "length" : 42261000, "type" : "autosome" },{ "name" : [ "23" ], "length" : 46386876, "type" : "autosome" },{ "name" : [ "24" ], "length" : 43947580, "type" : "autosome" },{ "name" : [ "25" ], "length" : 38499472, "type" : "autosome" },{ "name" : [ "Un" ], "length" : 55413200, "type" : "autosome", "visible" : "data" },{ "name" : [ "MT" ], "length" : 16596, "type" : "mitochondrial", "visible" : "data" }

]}

• If the chromosome names for your species and build are not the standard 1, 2, 3, etc. then you will want toinclude alias names along with the standard names for that chromosome. For example if your species labeledchromosome 1 as ch01 then you can include it as an alias as follows.

{ "name" : [ "1", "ch01" ], "length" : 60348388, "type" : "autosome" },

Note: For computing index and coverage information for BAM files loaded into a GenomeBrowse window, SVSmust be able to identify the corresponding reference sequence to be used in the computation. SVS uses matchingbetween the BAM header information and the Genome Assembly files that are saved locally to your machine for thispurpose. The chromosome names and lengths must match exactly between the two for the correct reference sequenceto be identified.

• Following each length entry should be a chromosome type designation, options for these entries include auto-some, allosome, and mitochondrial.

• The last optional entry for each segment is to only show certain segments in a full genome view if there isdata listed in that location by adding a visible entry, options for these entries include always, never and data. Ifnothing is listed the default choice to always show that region in a genome-wide zoom is assumed.

Save the file then close and reopen SVS. Now you should be able to use this assembly within a GenomeBrowsewindow in SVS by selecting Danio rerio (Zebrafish), Zv9 (Jul 2010) from the genome build dropdown menu.

4 1. Create a Genome Assembly for Danio rerio

2. Building Annotation Sources

New in SVS 8 is the Convert Sources Wizard!

• Open SVS and go to Tools >Manage Data Sources to open the Data Source Library.

Figure 2-1. Data Source Library

• Click the Convert... button on the bottom left of the dialog to open the wizard.

Note: Full documentation on this new tool can be found in the SVS manual or by selecting the Help button on thedialog.

5

http://doc.goldenhelix.com/SVS/latest/convert_source_wizard.html


Figure 2-2. Convert Source Wizard

6 2. Building Annotation Sources


A. Creating a Reference Sequence

An allele reference sequence source can be built for any species where there is an available DNA sequence (FASTA)file.

Download the available FASTA file for the Zv9 assembly from the Ensembl FTP site.

• Step 1: Click the Add button on the Define Input page of the Convert dialog navigate to the downloaded FASTAfile and select the *.fa.gz file. Then click Next >.

• Step 2: The converter will scan the file to come up with a list of the chromosomes (or scaffolds) that are includedin the FASTA and determine the length of each segment. It will also attempt to match the information found toan existing assembly file.

• Step 3: If a genome assembly match was found the next Change Options screen will show it in the GenomeAssembly (Build): drop-down box. For this data we have already created the assembly file but the chromosomenames in the FASTA file do not yet match.

• We will need to rename the segments using the option at the bottom of the dialog before it willcorrectly match to the Danio rerio Zv9 assembly.

• To rename select RegExp from the drop-down and type (.*) dna(.*) in the first box and \1 in thesecond. It should look like Figure 2-3.

Figure 2-3. Assembly match by renaming segments

• If you scroll down the segment list you will start to see some additional segments that were notincluded in the assembly file (unmapped scaffolds). In this case we do not want to include them

A. Creating a Reference Sequence 7

ftp://ftp.ensembl.org/pub/release-74/fasta/danio_rerio/dna/Danio_rerio.Zv9.74.dna.toplevel.fa.gz


in the reference sequence so right-click on the Use column header and select Uncheck Unmappedthen click Next >.

Note: SVS has an upper limit of 5000 segments that can be included. The wizard will scan all theavailable segments in the FASTA file but only allow the longest 5000 to be selected for inclusion in thereference sequence source.

Note: If no match is determine to an existing assembly file you can have the wizard create a new assemblybased off the segments and lengths determined by the FASTA data. You will just need to select <CreateNew> from the genome build drop-down and fill in the required build information.

• The next window is for labeling the data source and documenting the conversion process, at mini-mum you will want to select an informative Name: for the source then Click Next >

Note: For data sources curated by Golden Helix we will fully document the source of the data includingany citations that are required by the provider. See Figure 2-4 for an example.

Figure 2-4.

• Step 4: For the last window you can select a location to save the created source, by default your SVS UserAnnotation Folder will be selected.

• Click Convert to create the reference sequence.



B. Creating a Gene Annotation

A gene annotation track can be built for any species where there is an available gene annotation file, supported fileformats are Delimited Text, GTF, or GFF.

Download the available GTF file for the Zv9 assembly from the Ensembl FTP site.

• Step 1: Click the Add button on the Define Input page of the Convert dialog navigate to the downloaded GTFfile and select the *.gtf.gz file. Then click Next >.

• Step 2: The converter will scan the file to come up with a list of the chromosomes (or scaffolds) that can beused to match the information found to an existing assembly file.

• Step 3: The first screen will be a listing of the fields found in the file along with their type. You can select whichfields to include in the track and change the type if necessary. For this set we will leave the default options(Figure 2-5) and then click Next >.

Figure 2-5. Plot Type and Output Options Window

• If a genome assembly match was found the next Change Options screen will show it in the Genome Assembly(Build): drop-down box. For this dataset it should match to the correct Zebrafish assembly we have built. Thereis still a bunch of unmapped scaffolds we will not include in the track.

• Right-click on the Use column and select Uncheck Unmapped and click Next >.

• On the next screen fill in any documentation for the track (Figure 2-6) and click Next >

B. Creating a Gene Annotation 9

ftp://ftp.ensembl.org/pub/release-74/gtf/danio_rerio/Danio_rerio.Zv9.74.gtf.gz


Figure 2-6. Gene Track Documentation

• Step 4: For the last window select a location to save the created source. An additional feature that is availablewith gene annotation sources is the ability to index certain field. The indexing makes searching for those valuesin the GenomeBrowse plot window much faster. In this case leave the default Gene Name and Transcript Namefields to be indexed (Figure 2-7) and click Convert.

C. Visualizing the Annotation Sources

Now that the tracks have been created they can be used in SVS for analysis or just for visualization.

• Open a new GenomeBrowse window by going to Tools >New GenomeBrowse Window

• Select the Danio rerio (Zebrafish), Zv9(Jul 2010) assembly from the genome assembly drop-down menu, thenclick Add

• Select both of created sources Ensembl Genes 74, Ensembl and Reference Sequence Zv9, Ensembl and then clickPlot & Close

• You can zoom into different features or type in any Zebrafish gene name to jump to that location. For exampletype GCNT7 in the location bar to automatically zoom into this region (Figure 2-8).

• If you hover your mouse over Exon 1 of the gene and scroll up you can zoom in and see the proteins that makeup the exon of the gene annotation source as well as the nucleotides that make up the reference sequence at thatlocation.



Figure 2-7. Index Field Options

Figure 2-8. GCNT7 Gene View

C. Visualizing the Annotation Sources 11

3. Create a ‘Fake’ Genome Assembly forany Grouping Variable

Let’s say we have a phenotypic dataset with 500 samples and columns for the subject’s age, the state in which theylive, and their weight. Of course we know that weight typically increases with age and also some states have a higherprevalence of obesity. Because of this we may want to plot the weight variable in the genome browser and separatethe data by state, similar to separating by chromosome in a genotypic dataset.

To continue, you will need the dataset downloaded at the beginning of this tutorial. From an open project go to Import>Text. Browse to the download location and select the obesity.csv file. Then click Open, leave the default settingsand click OK.

• Now, open the obesity Dataset - Sheet 1.

• Choose File > Create Marker Map from Spreadsheet.

• For Select marker name column: choose Row Labels (sampleID), for Select chromosome column: choosestate, and for the Select position column: choose age. Enter US State Map as the New Marker Map Name.

• Your window should match Figure 2. Click Next>.

• Leave weight: in the Create Marker Map Parameters – Step Two window checked and click Create to createthe marker map.

Figure 2. Create marker map from spreadsheet window

12


Now you need to apply the map. From the obesity Dataset – Sheet 1 spreadsheet choose File >Apply GeneticMarker Map and choose the one we just created. Make sure that you select Row labels under Marker Names Areat the bottom of the window and click OK (see Figure 3). Another dialog window will pop up allowing us to enabledefault marker map fields. For now, leave all three as checked and click OK. A new mapped sheet will be created.

Figure 3. Select A Genetic Marker Map Dialog Window

Next, close the spreadsheets, and from the Project Navigator choose Tools > Manage Genome Assemblies, or pressCtrl-G. In the Manage Genome Assemblies window, select From Marker Mapped Spreadsheet. . . Choose theobesity Dataset – Mapped Sheet 1 and click OK. For the Name:, type US State ‘Genome’ and for the Build:, typeStates in the first field and US States in the second (see Figure 4). Click OK and a new genome assembly will becreated. Close the Manage Genome Assemblies window.

Now we can see how this was useful!

• Open the obesity Dataset - Mapped Sheet 1.

• Right-click on the weight column header and choose Plot Variable in GenomeBrowse.

• At the top of the GenomeBrowse window, you should see a drop-down menu that currently says Homo sapiens(Human), GRCh37hg19 (Feb 2009).

• Click the arrow on the right side of the box and find the genome assembly that we just created States, US States.Now your data should be visible in the plot viewer.

13


Figure 4. Specify Genome Assembly Build Information

• To separate by ‘chromosome’ or in this case State, click on the weight node in the Plot Tree and on the Displaytab of the Controls window select Chromosome under the Style By: drop-down.

Figure 5. Plot of weight grouped by state

The resulting plot should look like Figure 5. If you also want to be able to see the values for each data point you canopen a Feature List for the plot by right-clicking anywhere on the graph and selecting Feature List

14 3. Create a ‘Fake’ Genome Assembly for any Grouping Variable


Figure 6. Plot with Feature List

15

creating and using genome assemblies tutorial...to the same coordinate system. the value of the...

Documents