xgdbvm: a web gui-driven workflow for annotating ...large-scale biology article xgdbvm: a web...

LARGE-SCALE BIOLOGY ARTICLE

xGDBvm A Web GUI-Driven Workflow for AnnotatingEukaryotic Genomes in the CloudOPEN

Jon Duvicka Daniel S Standageb Nirav Merchantc and Volker P Brendeld1

a Department of Genetics Development and Cell Biology Iowa State University Ames Iowa 50011bDepartment of Biology Indiana University Bloomington Indiana 47405cBio Computing Facility University of Arizona Tucson Arizona 85721dDepartment of Biology and School of Informatics and Computing Indiana University Bloomington Indiana 47405

ORCID IDs 0000-0003-4329-2712 (JD) 0000-0003-0342-8531 (DSS) 0000-0002-8055-7508 (VPB)

Genome-wide annotation of gene structure requires the integration of numerous computational steps Currently annotationis arguably best accomplished through collaboration of bioinformatics and domain experts with broad communityinvolvement However such a collaborative approach is not scalable at todayrsquos pace of sequence generation To addressthis problem we developed the xGDBvm software which uses an intuitive graphical user interface to access a number ofcommon genome analysis and gene structure tools preconfigured in a self-contained virtual machine image Once theirvirtual machine instance is deployed through iPlantrsquos Atmosphere cloud services users access the xGDBvm workflow viaa unified Web interface to manage inputs set program parameters configure links to high-performance computing (HPC)resources view and manage output apply analysis and editing tools or access contextual help The xGDBvm workflow willmask the genome compute spliced alignments from transcript andor protein inputs (locally or on a remote HPC cluster)predict gene structures and gene structure quality and display output in a public or private genome browser complete withaccessory tools Problematic gene predictions are flagged and can be reannotated using the integrated yrGATE annotationtool xGDBvm can also be configured to append or replace existing data or load precomputed data Multiple genomes can beannotated and displayed and outputs can be archived for sharing or backup xGDBvm can be adapted to a variety of usecases including de novo genome annotation reannotation comparison of different annotations and training or teaching

INTRODUCTION

Thenumberofsequencedeukaryoticgenomes is increasing rapidlydue toadvances in sequencing technology andcost-effectivenessfor recent lists see httpsgoldjgidoegov (Reddy et al 2015) andhttpwwwdiarkorgdiark (Hammesfahr et al 2011)However thepace of data acquisition leads to bottlenecks at both assembly andannotation stages before the sequence data can be consumed forresearch In particular annotating a novel genome is often chal-lengingduetoour incompleteknowledgeofwhatconstitutesageneacross a wide range of species meaning that ab initio gene pre-dictionalthoughuseful is inadequate (YandellandEnce2012)Fullgenome annotation typically consists of at minimum (1) optionallyrepeat masking the genome (2) splice-aligning transcripts andproteins from related species for evidence-based gene structureprediction (3) using ab initio gene finding algorithms to annotatepossible gene structures (4) combining the above data sources tocreate a set of possible gene structures and (5) filtering the resultsthrough quality andor similarity filters to find themost probable set

ofstructuresthatrepresentfull-lengthornear-full-lengthcodinggenesAs a result genome annotation is necessarily a time-consumingand computationally intensive process that combines numeroustypes of sequence analysis and heuristic prediction typicallyrelying on well-annotated genomes as a reference and typicallyresulting in a far from perfect (but arguably useful) draft an-notation A number of groups have published complete com-putational pipelines for eukaryoticgenomeannotation (Mungall et al2002Potteretal2004Uberbacheretal2004Cantareletal2008Foissac et al 2008 Holt and Yandell 2011 Specht et al 2011Grigoriev et al 2012 Leroy et al 2012 Thibaud-Nissen et al 2013Hoff et al 2015) However these pipelines require considerableexpertise to install configure troubleshoot and manage We pro-pose that a ldquoturnkeyrdquo genome annotation system could greatlybenefit researcherswhodesireacredibledraft genomeannotation tofacilitate further research aswell as foster comparativegenomics asearly as possible in the life of their project Among the desirable at-tributesofsuchasystemwouldbethefollowing (asdescribed inmoredetail below) easy configuration easy to use editable reproduciblescalable and publishableAn annotation workflow will necessarily combine a wide range

of computational tools whose successful configuration and in-teroperabilitywouldbechallenging for thenonspecialist so ideallyit should be available as a precompiled package A commonmethod for packaging and distributing such a complex system isvia a virtual machine (VM) which encapsulates the underlying

1Address correspondence to vbrendelindianaeduThe author responsible for distribution of materials integral to the findingspresented in this article in accordance with the policy described in theInstructions for Authors (wwwplantcellorg) is Volker P Brendel(vbrendelindianaedu)OPENArticles can be viewed without a subscriptionwwwplantcellorgcgidoi101105tpc1500933

The Plant Cell Vol 28 840ndash854 April 2016 wwwplantcellorg atilde 2016 American Society of Plant Biologists All rights reserved

server operating system the application software componentsalong with all requisite software dependencies and configurationsettings all of which are stored (ldquoimagedrdquo) in such a way that theycan be copied and launched by means of commonly availablevirtualization tools and made available to anyone with access tovirtual server software such as KVM (httpwwwlinux-kvmorg)or VirtualBox (httpswwwvirtualboxorg) VMs have a numberadvantages for complex informatics analysis (Nocqetal 2013) ofwhich thepreinstallation of all required software for complex tasksaswell as temporary access to all the computer resources neededfor completion of the task are of most practical value for a typicalbiologist user Cloud computing platforms such as OpenStack(httpswwwopenstackorg) and Docker (httpswwwdockercom) offer VM and container-based technologies that can bemanaged accessed remotely and readily deployed on com-mercial cloud-based services such as Amazon Web Services(httpsawsamazoncom) Government-funded consortia suchas the iPlant Collaborative (now CyVerse) (Goff et al 2011) makesuch virtual platforms readily accessible to individual users via theinternet

Although most genome researchers are familiar with a widerange of online tools to evaluate sequence data they will notnecessarily know how to put them together and configure themappropriately Ideally an annotation platform should have a co-hesive graphical user interface (GUI) that guides the user throughsetup configuration parameter setting and status reportingImportantly all setup and processing steps should be managedwith data sanity checks (for completeness and format) context-dependent menus error logging and reporting and help docu-mentationtutorials

Ability toedit and improveautomatedannotation shouldbebuiltin This means the ability both to add additional data once theworkflow has completed and to modify individual annotations insuch a way that the most critical regions of the genome are wellannotated are needed

With variable parameters and source data sets automateddocumentation and simple archiving are essential for ensuringrepeatability of the genome annotation process

With large genomes and large transcript data sets computa-tions such as spliced alignment can take days or weeks ona typical lab computer whereas with access to high-performancecomputing (HPC) resources the process can be completed ina few hours Many research facilities have such resources buttheir use is complex and not necessarily available to any re-searcher who might be interested

Once computation is complete the annotated genome and itsinputoutput files should be available online either to a selectcommunity (with password access) or to the research com-munity at a whole thus placing output data andor communityannotation tools in the hands of the target audience in a timelymanner

With the above attributes in mind we created a self-containedgenome annotation platform xGDBvm for use by the researchcommunity We report below our initial release of xGDBvm in theiPlant (CyVerse) Atmosphere cloud infrastructure (httpwwwiplantcollaborativeorgciatmosphere) as an on-demand virtualserver for genome annotation that can be adapted for wide rangeof research needs

RESULTS

Overview of xGDBvm

xGDBvm is a Linux-based platform that accepts genomic andtranscript andor protein sequence inputs and creates a genomeannotation that can be displayed in the included full-featuredgenome browser with separate tracks for genome segmentstranscript and protein alignments gene predictions and repeatmasked regions (Figure 1) xGDBvm uses a modified and ex-tended version of the xGDB (Extensible Genome Data Broker)Webplatform (Schlueter et al 2006)written inPerl andPHP alongwith aWeb server workflow automation scripts and executablespackaged together as a virtual server and configured for accessover HTTP or HTTPS via a GUI xGDBvm is compact in sizeoccupying 13 gigabytes (GB) of a typical 20-GB VM root par-tition Data inputsoutputs are preferably stored on external vol-umesmounted to the VM thus alleviating constraints on VM sizeComputational processes in xGDBvm (Figure 2) are managed

by automated user-configurable workflows with a built-in optionfor calls to HPC resources Optional masking of genome seg-ments is performedusing Vmatch (Abouelhoda et al 2002) basedon user-provided masking libraries Spliced alignment of transcriptsand proteins to the genome are computed using GeneSeqer(Usuka et al 2000) and GenomeThreader (Gremme et al2005) respectively xGDBvm optionally creates gene modelpredictions using CpGAT (Comprehensive Gene Annotation ToolhttpplantgdborgAtGDBcgi-binWebCpGATpl) a set of scriptsand binaries that integrates spliced alignment data and ab initiogene predictions along with BLAST similarity filters and alternativestructures to derive a high-quality gene prediction data set ThexGDBvmworkflowcan also upload precomputedgene predictionsfrom a user-providedGFF3-formatted file All steps are logged anddisplayed dynamically during workflow operation Once completeeach feature is displayed as a separate track in a fully featuredgenomebrowser completewith searchdownload tools and tabularfeature views A quality score assigned to each annotated locusfacilitates the identificationof low-qualitymodelswhichcanthenbereannotated and curated using the built-in yrGATE annotation tool(Wilkerson et al 2006) Additional genomes can be configured andcreated with the same VM and the user can archive and retrievesingle or global data sets Any data type can be appended or re-placed using an ldquoupdaterdquo feature The outcome is a rich editableenvironment for genome exploration and annotation accessiblelocally or remotely on the Web (for feature overview see Table 1)

xGDBvm-iPlant

We implemented xGDBvmas aVM image on iPlantrsquos Atmospherecloud platform (httpsatmoiplantcollaborativeorgapplication)available to registered life sciences researchers (httpwwwiplantcollaborativeorgcontentacceptable-use-policy) We fur-ther customized the VM taking advantage of iPlantrsquos data and jobexecution application programming interfaces (APIs) makingxGDBvm a one-stop destination for genome annotation anddisplay Registered iPlant users can create and configure anxGDBvm instance via the Atmosphere control panel and thenaccess the xGDBvm instance via a Web browser to perform all

XGDBvm Genome Annotation 841

Figure 1 Overview of xGDBvm as Implemented at CyVerse (iPlant)

xGDBvm isavirtual server environment for genestructure annotation that canbecloned configured populatedwith inputdata and run fromaWebbrowserin a few steps as summarized here(A) Log in to theCyVerseAtmosphereControl Panel (httpsatmoiplantcollaborativeorgapplication) (1) and click to create a new instance (cloned copy) ofxGDBvm (2) createablockstoragevolume foroutput data andattach it to the instance (3)OpenaWebshell interface (4) accessible from theControl Paneland type a series of commands to set up and configure the new xGDBvm instance also mounting the Data Store and the attached volume(B) Log in to the CyVerse Data Store cloud storage system (httpsdeiplantcollaborativeorgde) and upload input data files to an input data directory(accessible to the VM) using a batch uploading tool Naming conventions are used to identify each input type(C) Log in to the xGDBvm instancersquos GUI using HTTPS via its unique IP address or using a VNC (1) All subsequent steps are performed using the xGDBvmGUI Authorize the VM to connect to remote HPC resources via the Agave API (httpagaveapico) (2) Configure the path to Data Store inputs and set otherparameters including remote jobexecution (optional) xGDBvmwill validate files returnexpectedoutputs andflagany inputfile errors (3) Initiate automatedworkflows and monitor progress (4) The workflow sends some data remotely for processing on HPC resources (httpswwwxsedeorg) managed byAgave APIs and processes other files locally using the attached volume as a scratch disk The xGDBvmworkflowwaits for HPCoutputs and then proceedswith the annotation processOutput data arewritten to the external volumeandcanbeaccessed fromxGDBvmWebbrowser asGDB001GDB002 etc (5)

842 The Plant Cell

subsequent tasks validate inputs run HPC jobs initiate localworkflows check progress and viewedit the resulting genomeannotation The genome browser(s) can bemade public or privateas desired The following sections detail xGDBvmrsquos functionalityin its current version on iPlant Atmosphere

Inputs and Data Processing

Figure 3 diagrams the modular architecture used by xGDBvm atiPlant For managing inputs xGDBvm uses iPlantrsquos Data Storecloud storage service (httpwwwiplantcollaborativeorgcidata-store) whichprovides high-capacity storage and tools forquickly uploading user data files During the xGDBvm configu-rationprocess theuserrsquosDataStorehomedirectory ismounted totheVMrsquos file systemusing IRODSFUSE (httpirodsorg) and filesuploaded to the Data Store are thus accessible on the VM usingUnixfile systemcommandsForoutputdata (alignmentfilesGFF3files sequence indexes MySQL database tables configuration

files and archives) the user can attach a block storage volume tothe VM via the Atmosphere control panel andmount it to the VMrsquosfile system This data partitioning strategy has the advantage thatall data outputs are separate from the VM and do not consume itslimited storage capacity while at the same time providing scal-ability as the data transfer for HPC jobs occurs directly with thedata store Moreover the complete xGDBvm display can be re-constituted by mounting the volume to a new xGDBvm instanceuseful in the event a VM becomes unavailableManaging files and ensuring validity of inputs (sanity checks) is

a challenge for computational pipelines where multiple inputs ofvarious types and formats may be used xGDBvm makes use offilename standardization and extensive validation tools to reducethe incidence of input errors Each input file is required to benamed according to its data type and file format egestfa fora FASTA file of EST sequences where ldquordquo is any user prefix andall inputfilesareplaced inasingledirectorywhosepath issavedasa configuration variable In addition output files (including copies

Figure 1 (continued)

In addition to a fully featured genomebrowser xGDBvm includes tools to query update reannotate download or archive outputs to the userrsquosData StoreFor details refer to the xGDBvm wiki (httpgoblinxsoicindianaeduwikidokuphp)

Figure 2 Data Process Schema

Input data types (with standardized names as indicated) computational modules and outputs are shown Images are screenshots of color-coded trackglyph types (gene models splice alignments) and track flags (quality scores) displayed in the xGDBvm genome browser


of input files) are all named according to the same conventionswith the GDB number as a prefix eg GDB001estfa and de-posited insubdirectoriesaccording to their typeprocessOnceaninput pathhasbeenspecified xGDBvmdisplays validfilenames inthe input directory according to type displays predicted outputtracks and alerts to any missing files that would compromiseoutput The user then initiates a script to validate sequencedeflines (description lines) error-check IDs and enumerate filecontents either singly or in batch mode (Supplemental Figure 1)File validity metadata are stored alongwith a unique file stamp sofiles need only be validated once unless modified

Supplemental Figure 2 shows the complete automatedworkflow for creating and updating a genome annotation Typicalinputs include a genome sequence assembly and a set of tran-script sequences (EST cDNA or short readtranscript sequenceassembly [TSA]) andor predicted protein sequences in FASTAformat Depending on availability transcripts may be from thesame or a closely related species (Wang et al 2008) Proteinsequences should be from a well-characterized genome as closeas possible taxonomically to the target species With transcript

(EST cDNA or TSA) inputs xGDBvm will compute splicedalignments according to user-specified or default parametersusing the multithreaded GeneSeqer-MPI spliced alignmentprogram (Usuka et al 2000) installed locally or on an HPC serverwith up to 128 cores For this step the user can opt to apply repeatmasking to the genome sequence using vmktreevmatch(Abouelhodaetal2002) to reducecomputationtimewith inclusionof a suitable repeat mask sequence library Alternatively the usercanprovide anN-maskedgenome file as input For related-speciesprotein inputs xGDBvm computes spliced alignments using theGenomeThreaderprogram (Gremmeet al 2005) either locallyoronanHPCserver Spliced alignments thatmeet aquality thresholdareultimately displayed in the xGDBvm genome browser as discretetracks with standard box-line glyphs to indicated exonintronboundaries (Figure 2) The user can also provideGeneSeqer andorGenomeThreader output files created offline as inputs bypassingthe above stepsThe xGDBvm workflow next uses spliced alignment data

as input for CpGAT which assembles gene model pre-dictions for the genome CpGAT uses EVM (Evidence Modeler

Table 1 xGDBvm Features

Section Feature Functions

Manage Administrate Modify password protection customize site name administer yrGATE user accountsCreateconfigure Configure new GDB validate input files viewedit configuration initiate monitor

automated workflows view log files archiverestoredelete GDB copy archiveto Data Store

Remote jobs Configure OAuth2 login job APIs app IDS submit jobs view job status managejobs (CyVerse login required)

View GDB GDB home page GDB summary data view genome region or search for sequenceGenome context view View all tracks by genome segment and region zoom jump up- or downstream

view nucleotide level alignmentsView GDB feature tracks Gene predictions (loci) All annotated loci and metadata in tabular view searchfilter queries yrGATE

summaries for each locus download as csvAligned proteins aligned

transcriptsAll spliced alignments in tabular views searchfilter queries download as csv

GAEVAL scores Detailed gene quality scores for each Gene Prediction track searchfilter queriesView GDB tools (genome

context view)Download region Download any sequence type from region as FASTA download annotations from

region in GFF3 or NCBI formatDownload data Download individual input files output files (all types) or GDB archive files

to the local driveSearch ID or keyword Search and retrieve FASTA sequence or subsequence (introns exons

updownstream) for any feature displayed on GDBBLAST GDB Match sequence within GDBBLAST all GDB Match sequence across multiple GDBCpGAT annotate region Regional gene predictions and quality scoresAdd custom track Add custom track from local GFF3 fileGenomeThreader region Regional spliced alignment of proteinsyrGATE Tool for creatingsubmitting user-contributed annotations with portals to NCBI

ORF finder NCBI BLAST GENSCAN GeneMark CpGATCommunity central Searchable list of curated yrGATE (user-submitted) annotations download

annotations (FASTA GFF3)Annotate My annotations Manage user annotations (admin account and login required)

My groups View group annotations (admin account and login required)My admin Curate user-submitted annotations (admin account and login required)

Help Help pages User instructions and video tutorials also available as contextual help pop-upsxGDBvm wiki (external) Documentation and instructions for usersadminsdevelopersGitHub repository (external) Source code issue tracking case studies

Features as implemented on iPlant (CyVerse) Atmosphere cloud service

844 The Plant Cell

httpevidencemodelergithubio) (Haas et al 2008) to eval-uate GeneSeqer transcript alignments andor GenomeThreaderprotein spliced alignments together with ab initio gene finderresults from BGF (httpbgfgenomicsorgcn) GeneMark (httpexongatecheduGeneMark) (Borodovsky and Lomsadze2011) and Augustus (httpbioinfuni-greifswalddeaugustus)(Stankeetal 2006)andderivesanoptimal setof transcriptmodelsthat are then BLASTed against a reference protein data set (ifsupplied by the user) In addition some PASA (Haas et al 2003)functions are used to aggregate splice variant models where in-dicated by evidence alignments Optionally the user can requestrepeat masking of the genome prior to ab initio gene predictionThe output from CpGAT is a set of BLAST-filtered or unfilteredgene model structures for each genome segment completewith coordinates for startstop codon and predicted untranslatedregions where possible in GFF3 format which are loaded tothe xGDBvm database Several CpGAT parameters are user-configurable with the xGDBvm GUI allowing the user to selectspecies model or bypass ab initio gene finders relax referenceprotein BLAST filtering or request repeat masking and thecomplete set ofCpGATparameters canbemodifiedbyediting theCpGAT configuration file

As a final step xGDBvm calculates the GAEVAL score for eachgene model consisting of a set of statistics representing thedegree of congruence of the model with available alignment ev-idence (httpplantgdborgGAEVALdocsindexhtml) GAEVAL

also reports alternative splicing evidence and classifies annota-tion errors into discrete types such as gene fusion gene fissionetc GAEVAL data summaries are displayed in xGDBvm as a flagassociated with each track glyph (Schlueter et al 2005)Users can also upload precomputed genome annotations

provided as GFF3 file(s) along with optional transcript and trans-lationFASTAfilesThesedataaredisplayed in the formofaseparateannotation track with GAEVAL scores calculated as describedabove If gene descriptions are available in tabular form these canalso be uploaded to augment gene annotation tracks

xGDBvm Setup Configuration and Data Processing

xGDBvmwas designed to be easy to configure and run (Figure 1)As a supplement to online help and video tutorials (see below)beginningusers canconsult thexGDBvmwiki (httpgoblinxsoicindianaeduwikidokuphp) which includes step-by-step in-structions and information about how to choose the correct VMsize and storage capacity for their particular genome annotationneedsAfter instancecreation theuser accesses theshell viaa terminal

emulator or the Atmospherersquos built-in shell emulator and typesa series of simple commands to configure and password-protectthe VM environment Subsequent steps are accomplished usingaWeb browser connecting to the VM via HTTPS or by connectingto the VM using a virtual network computing (VNC) client

Figure 3 xGDBvm Architecture

An xGDBvm VM instance as hosted on the CyVerse Atmosphere cloud infrastructure (httpsatmoiplantcollaborativeorgapplication) hasseparate file system partitions under root (containing the xGDBvmWeb GUI scripts binaries and other software) and home (which is configuredwith mount points for the userrsquos Data Store home directory for data input and a block storage volume for data output) The Agave API hosted by theCyVerseDiscovery Environment is used for authentication of the VMviaOAuth2 and formanagement ofHPCapplications and job submission A keyfeature of xGDBvm is the ability to attach andmount the output volume to adifferent VMand reconstitute the annotation outputs anddisplay See textfor details


(Atmosphere offers a built-in VNC window as well) xGDBvmrsquoshierarchical user interface is organized by task type ie ldquoMan-agerdquo ldquoViewrdquo ldquoAnnotaterdquo and ldquoHelprdquowith submenus under eachsection Under ldquoManagerdquo are ldquoAdminrdquo (manage site passwordsadmin emails and yrGATE users) ldquoConfigureCreaterdquo (create orupdate a genome browser) and ldquoRemote Jobsrdquo (configure andmanage remote HPC jobs see next section) End-user-orientedsections include ldquoViewrdquo (browseanalyze genomes) and ldquoAnno-taterdquo (submitmanage user annotations) Each section and sub-section includes a ldquoGetting Startedrdquo page that outlines thesuggested workflow along with key links and one or more ldquoHelprdquopages with detailed documentation including video tutorials thatcanbe viewedon theVMContextual pop-uphelp dialogs are alsoprovided for each pagestep

Under Manage rarr ConfigureCreate a user can check volumecapacity of the VM manage license keys for certain installedsoftware and then consult a decision tree to guide them to thecorrect data sources a table of file nameconventions andaguideto CpGAT annotation Once the data files are in place the userclicks ldquoCreate New GDBrdquo selects a file path pointing to the datainput files enters any nondefault parameters as well as genomemetadata and then saves the configuration setup which is as-signed ldquoDevelopmentrdquo status and an ID (GDB001 etc) that will beassociatedwith theoutputdatabase (Figure4A) Theusercannowclick to validate file contents as described above To initiate dataprocessing the user selects ldquoData Process Optionsrdquo followed byldquoCreate GDBrdquo which changes status to ldquoLockedrdquo initiates thecentral data processingworkflow anddisplays a running report ofprogress together with any errors Theworkflowcan be aborted atany time by clicking the ldquoAbortrdquo button under ldquoData ProcessOptionsrdquo this removesall dynamically createddirectoriesandkillsall associated processes returning the configuration to ldquoDe-velopmentrdquo status On successful workflow completion GDBstatus ischanged to ldquoCurrentrdquoand thenewgenome isadded to theldquoViewrdquomenu structure Input data sets annotation statistics andoutput data sets can be viewed online Output errors are loggedand displayed to the user alongwith context-specific help dialogs(Supplemental Figure 3)

Any of several lightweight preconfigured sample data sets(Supplemental Figure 4) canbe loadedwith a single button click fromthe ldquoCreate Newrdquo page and then saved and processed to a finishedGDB in nomore than a fewminutes Because these examples coverthe complete range of processes and workflows in the xGDBvmcode they also serve as functional tests for functionality when firstsetting up an xGDBvm instance or modifying its code

High-Performance Computing Option

On multiprocessor VMs xGDBvm automatically invokes parallelprocessing where possible for certain computational steps(Supplemental Figure 1) This can speed up spliced alignment andgenome annotation (CpGAT) jobs in that more than one genomesegment can be evaluated concurrently on separate processorthreads As an alternative for even more processing powerxGDBvm is capable of sending input data for spliced alignmentjobs to high-performance computing facilities either as a stand-alone job or as part of an annotation workflow For this option theuserrsquos input data must be on a VM-mounted iPlant Data Store

directory and assigned to a GDB with ldquoDevelopmentrdquo statusGeneSeqer-MPI and GenomeThreader binaries along withwrapper scripts for job submission to an HPC server are installedin iPlantrsquos Discovery Environment (httpsdeiplantcollaborativeorgde) as executable apps Client access to HPC resources andapps is managed via the Agave API (Dooley et al 2012) httpagaveapico which provides an open-source platform for inter-acting with computational resources that are managed under theXSEDE system (httpswwwxsedeorg) xGDBvm uses Agaversquosimplementation of the OAuth2 (httpoauthnet) standard forauthorization and subsequent authentication to use apps UnderManagerarrRemote Jobs users first submit their iPlant user namepassword in return for OAuth2 credentials that are stored securelyon the VM and allow access to remote applications (GeneSeqer-MPI and GenomeThreader) The user can then log in and obtaina temporary access token and refresh token for authenticationThe VM-cached refresh token is also used by local scripts toreauthenticate API access during automated workflow process-ing The user can select the app size (ie number of processors)for optimal efficiency given their genome size and complexity andthen return to the GDB Configuration page select the ldquoremoterdquooption for spliced alignment and initiate the automatedworkflowThe xGDBvmworkflowscript copies relevant input data (genometranscript andor protein) to a temporary directory on the userrsquosmounted Data Store directory and issues a job submissioncommand via cURL (httpscurlhaxxse) to a custom wrapperscript (Figure 3) The wrapper script accepts parameters splitsand indexes input files as appropriate formultiple processors andthen issues a command to launch GeneSeqer-MPI or Ge-nomeThreader on the specified HPC server cluster ThexGDBvm workflow updates remote job status periodically usinga callback URL to xGDBvm andor email notification serviceOutput data are copied to specified subdirectory on the userrsquosData Store directory where xGDBvmrsquos workflow can accessthem for further processing Remote job details and status aretracked by xGDBvm and users can access job lists query re-mote job status and kill a remote job using the Manage rarrRemote Jobs GUI (Figure 4C)RemoteGeneSeqerorGenomeThreadersplicedalignment jobs

can also be run as a standalone process via Manage rarr RemoteJobs Output is archived on the usersrsquo Data Store directory andxGDBvm can be directed to evaluate the output and copy outputfiles to an input directory for inclusion in workflow processing

LoggingTroubleshooting

Each step in xGDBvmrsquos computational workflow script(Supplemental Figure 2) is displayed dynamically during auto-mated workflow operation and saved in a process log Commonerrors (eg mismatch in data inputoutput incorrect formatduplicate IDs) are flagged and logged in an error file along withuser hints to remedy the problem (Supplemental Figure 3) Aseparate file is created for logging CpGAT progress

Outputs and Data Analysis Tools

xGDBvm displays the output of workflow processing as schema-tized glyphs organized into color-coded tracks in a full-featured

846 The Plant Cell

Figure 4 xGDBvm Data Management


genome browser (Figure 5) Standard tracks include EST cDNATSA and protein spliced alignments precomputed and CpGATgene predictions and regions that have been repeat maskedor assigned as spacer regions (N-substituted) Additional user-generated tracks include yrGATE annotations and region-specificCpGAT annotations Advanced users can create unlimited addi-tional tracks bymanually populating newdata tables andmodifyingconfiguration files The xGDBvm genome browser has track fea-tures similar to those currently available at httpplantgdborg(zoomscroll showhide or reorder tracks change font size viewbase pair level) The genome browser also includes a suite ofanalysis tools including search and retrieve for sequence orsubsequence regions (introns exons updownstream regions)NCBI-BLAST for sequence queries within or across genomesregion-specific GenomeThreader and CpGAT tools and theability to add a custom track from a local GFF file Com-plementing the Genome Context View are searchable tabularviews for each Feature Track type ordered by genome positionThe Gene Models table displays annotated loci along with struc-tural metadata similarity descriptions GAEVAL gene qualitycoverage and yrGATE annotation status (see below) The AlignedProteins and Aligned Transcripts tables display splice-alignedsequences of each typewith filters for alignment qualitycoverageand links toalignmentdetailsAseparatepage forGAEVALScoresdisplays comprehensive gene quality databased on comparisonof gene predictions with alignment evidence and offers multiplesearch filters

All inputs outputs and archives (see below) are stored hier-archically under xGDBvmdataGDBnnndata and they are alsoavailable fordownload to local storageusing theVMrsquosGUI (ViewrarrGDBnnnrarrDataDownload)Using thisdownloadservice theusercould forexample retrieveGFF-formattedannotationoutputs fromCpGAT for use in further analysis or display on a different genomebrowser Data files can also be copied to the Data Store eithermanually or by creating and copying a GDB Archive (see below)

Updating or Adding Tracks

In cases where the user may wish to append or replace dataxGDBvm includes an ldquoUpdaterdquo branch to the data workflow al-lowing any track to be appended or replaced The user sets anldquoUpdaterdquo flag on the configuration page specifies a directorywhere update data resides and selects the data type(s) andupdate action(s) desired The user then clicks ldquoUpdaterdquo whichadds or replaces data inputs and reruns appropriate scripts to

update the genome data tables indices and display All updateactions are logged in the same way as a new GDB appended tothe same process logThe xGDBvm wiki (httpgoblinxsoicindianaeduwiki) in-

cludes complete instructions for adding additional annotation oralignment tracks beyond the five standard tracks available Usersfamiliar with MySQL and the necessary computational steps cancompletely customize an instance of xGDBvm usingprecomputeddata as inputs

Managing xGDBvm Data Sets

Output data sets can be managed on the Manage rarr ConfigCreate rarr ArchiveDelete page (Figure 4B) For archiving a GDBthe entire output directory tree is compressed as a tar archive andstored in an archive directory under xGDBvmdataArchiveGDBand thearchivecanbecopied to theuserrsquosDataStorewithasingleclick If the corresponding GDB is later dropped (see below) orbecomes corrupted the archive can be readily restored using theldquoRestore from Archiverdquo button GDB archives also facilitatesharing data with other researchers who can use the ldquoRestorefrom Archiverdquo function to load any archive to their own VM Inaddition all GDB can be archived together using the ldquoArchive Allrdquofunction Any ldquoCurrentrdquo xGDBvm database can be discardedusing the ldquoDroprdquo button This removes all GDB-associateddirectories and their output data but preserves the GDB ID and itsstored configuration data allowing users to build on the previousconfiguration or restore (see above) a GDB Finally the mostrecently addedGDBcanbe deleted using ldquoDeleterdquoor all GDBcanbe deleted using ldquoDelete Allrdquo

Reannotating with yrGATE

A key feature of xGDBvm is the ability to flag low-quality genestructures and improve them in place bymanual reannotation Foreach genome displayed on xGDBvm the ldquoGene Modelsrdquo pageprovides filters to select high-coveragelow-integrity models(based on GAEVAL quality score and coverage) that might beimproved by manual inspection (Figure 6A) Users can create anannotation login account and correct confirm or disqualify anygene prediction using the yrGATE annotation tool (Wilkersonet al 2006 Figure 6B) The yrGATE tool offers point-and-clicksimplicity for building a gene structure enhanced by dy-namic reporting of GAEVAL scores to guide the user to thebest possible model based on evidence alignments yrGATE


(A) Screenshot of the GDB Configuration page set up for processing Example data Each genome annotation is assigned a unique identifier (GDB001GDB002 etc) and a user-provided name In addition to form fields for input data path annotation parameters andmetadata this page provides extensivecolor-coded information about all system settings (eg license keys storage capacity and login status displayed in blue-green) input data validity (lightgreen) and expected output (orange) The form includes buttons that launch modal windows to initiate computational workflow or edit configuration(B) Screenshot of ArchiveDelete menu showing genome databases with ldquoCurrentrdquo (blue computation complete) or ldquoDevelopmentrdquo (gray not yet run)status Genome annotations are identified as GDB001 GDB002 etc Each table row displays information about a GDB including time stamps as well asaction buttons that allow the user to drop delete archive delete archive or copy database (see text for details) Global action buttons (top right) allow theuser to delete or archive all data on the VM(C)Screenshot of ldquoList All Jobsrdquo pagewith tools tomonitor andmanage remote HPC jobs The page displays IDs jobmetadata time stamps color-codedstatus indicators and action buttons to manage output (Stop Job Delete Job View Logs Copy Output) via the Agave API See text for details

848 The Plant Cell

includes curation tools for users who are assigned Adminis-trator status providing a quality check for submitted annota-tions prior to their display All reannotation and curation stepsare performed in a single browser window with portals to NCBIBLAST and other analysis tools and users can manage theirown annotations (save submit for curation delete) on theldquoCommunity Centralrdquo pages Administrative features include

the ability to assign users to annotation working groups trackannotation totals for eachuser andconfigureoneormore emailaddresses for administrative notification Once curatedyrGATE annotations are displayed as a separate track in thexGDBvm genome browser with color-coding to indicate rean-notation class (Figure 6B) and these can be downloaded in GFF3or FASTA format

Figure 5 Genome Context View

Shown is a typical region from theC rubellagenomeannotationdescribed inResults Genomespan is shown in yellow andgenome features (tracks) are aslabeled to the left and above each track Drag-and-drop reorder and ldquohide trackrdquo features are implemented here Top bar provides search and navigationcontrols left bar contains links to tools and views as well as to configuration and help pages Region submenu (orange) contains zoomscroll region-specific tools and formatting controls See Table 1 for details of xGDBvm tools and features


Benchmarking xGDBvm

Whole-Genome Annotation

Capsella rubella is an Arabidopsis thaliana relative with a se-quenced genome totaling 1348 Mb (Slotte et al 2013) Weevaluated xGDBvmasa tool for newgenomeannotation using theC rubella genome assembly (see Methods for sequence sourcesandparameters)WeobtainedbothArabidopsis cDNAsequencesand Arabidopsis predicted proteins as input for evidence align-ments We first computed high-quality transcript and proteinspliced alignments using the standaloneHPC job submission toolin an xGDBvm instance at iPlant The GeneSeqer-MPI job (8processors with 64 threads) and GenomeThreader job (2 pro-cessorswith12 threads)finished in7hand1h respectively Theseoutputs were used as input for an annotation workflow (withCpGAT option selected) in xGDBvm The CpGAT reference dataset was the entire set of UniRef90 Viridiplantae proteins (seeMethods) In addition the C rubella annotation data set (in GFF3format) wasuploaded to xGDBvm for comparison Theannotationof 873 scaffolds was completed in 12 d on a single core pro-cessor VM with 4 GB RAM The results are shown in Table 2xGDBvmcompleted 49947 cDNA spliced alignments and 28595protein spliced alignments The CpGAT annotation generated25498genemodels comparedwith28447genemodels fromthepublished C rubella annotation A total of 4368 loci from thepublished annotation had no match in the CpGAT set (as de-termined by overlap) while 861 loci were unique to CpGATComparison of 19892 loci with gene models from both CpGATand the published annotation using ParsEval (Standage andBrendel 2012) revealed a high level of congruence between thetwo data sets More than 60 of the gene models compared hadidentical coding sequences At the level of individual exons thesensitivity (true positive rate) was 69and the specificity (truenegative rate) was 68 or 89 and 88 respectively if re-stricted to coding exons At the level of individual nucleotidesthe sensitivity and specificity were 97 and 96 respectivelyThese data demonstrate the reliability of CpGAT as aworkflowfor producing a provisional genome annotation (our purposeis not to present a detailed comparison of these two anno-tations the respective evidence alignment data sets andthresholds were likely not identical making such detailedanalysis complex)

Reannotation of Low-Quality Predictions

We evaluated GAEVAL gene quality for the C rubella annotationdata set on a locus basis by setting a locus table filter for averageintegrity lt75 and coverage gt75 This filter resulted in 254questionable loci with likely annotation errors for CpGAT modelscompared with 558 questionable models in the published an-notation set (Table 2) This subset represents models for whichreannotation has a high probability of improving gene predictionvia the yrGATE tool We chose an example of a locus from thepublished annotation that was flagged by GAEVAL as possiblyerroneous Carbubv1011418mg (Figure 6) The CpGAT anno-tation for this region was split into two distinct complete gene

structures identified as scaffold_1g5t1 and scaffold_1g6t1Using the yrGATE tool we confirmed the CpGATmodels asmoreaccurately representing the evidence alignments (dark and light-green tracks in Figure 6B)

Genome Region

Another use for xGDBvm is to annotate a genome segmentcontaining a specific gene or region of interest This would typi-cally be a rapid turnaround analysis compared with whole-genome analysis and thus could be performed using internalcomputing resources possibly repeatedly under different pa-rameter regimes As an example we used a Setaria italica pre-dictedprotein annotatedas ldquostem-specificproteinTSJT1-likerdquoasa tBLASTn query against theMusa acuminata subspmalaccensiswhole-genome sequence data in GenBank We retrieved a contig(839) thatcontaineda regionofhighsimilarity to thissequence (seeMethods)We thenconfiguredxGDBvm inputsconsistingofMusagenomic contig 839 the currentM acuminata EST data set fromGenBank and the predicted protein translations from the anno-tated genome of a related monocotyledonous plant speciesBrachypodium distachyon (httpwwwbrachypodiumorg) Theworkflow included gene prediction using CpGAT with UniRef90proteins from Viridiplantae as a reference data set (see Methods)TheCpGAToutput included4evidence-based loci and12ab initiopredicted genes including a model fully supported by transcriptalignment in the region with high similarity to XP_004977556(Supplemental Figure 5)

xGDBvm Implementation

iPlant

xGDBvm has been deployed as a public image on iPlantrsquos At-mosphere Cloud Service (httpsatmoiplantcollaborativeorgapplication) Researchers can launch an xGDBvm instance andexplore it once they have obtained an iPlant user account (httpsuseriplantcollaborativeorgregister) using an institutional emailaddress An iPlant account also grants the user a home page oniPlantrsquos Data Store Step-by-step instructions for setting upxGDBvm available at httpgoblinxsoicindianaeduwikidokuphpid=user_instructions can be summarized as follows (1) Inthe Atmosphere Control Panel find the latest xGDBvm imagelaunch an instance and attach an external block storage volumeusing drag-and-drop (2) access the instancersquos secure shell usingiPlant credentials and type simple commands to update xGDBvmcode set aWebpassword initialize IRODSFUSEmount externalstorage and launchaconfiguration script and (3) access theVMrsquosGUI via HTTPS or VNC and follow instructions there to configurecreate a genome annotation

Indiana University

xGDBvm has also been implemented on a ldquoproductionrdquo virtualserver at Indiana University serving as a host for PdomGDB(httpgoblinxsoicindianaeduPdomGDB) a genome databasefor Polistes dominula (European paper wasp) as well as the test

850 The Plant Cell

data sets described here (see Data Access) PdomGDB providesa showcase for the xGDBvm platform including the addition ofextra nonstandard feature tracks created using methods outlinedin the xGDBvm wiki (httpgoblinxsoicindianaeduwikidokuphpid=configure_new_track) PdomGDB is actively beingupdated

by the P dominula research community using the yrGATE tool forcontributing expert-curated gene annotations as described in thismanuscript (acceptedsubmissionsareaccessibleat httpgoblinxsoicindianaeduyrGATEGDB001CommunityCentralpl) Thiswebsite also includes general information on the xGDBvm

Figure 6 Gene Model Improvement Using yrGATE

(A)Apublishedgenemodel fromC rubella (Carubv1011418mg) showinghighcoveragelow integrity in theLocusTable (upper table highlighted columns)(B)Corresponding genemodel in genome context view (blue glyph) CpGAT annotated this region as two distinct loci (magenta glyph) backed up by bothArabidopsis protein (black) and cDNA (light blue) The regionwas then reannotated using yrGATE (dark and light green glyphs) to confirm themost probablygenic structure of this region basedonavailable evidence yrGATEglyphs are color-coded according to the type assignedby the annotator eg dark green(improved structure) and light green (new structure not previously annotated)

Table 2 Annotation of the C rubella Genome

GenomeSegments

Total Length(bp)

ArabidopsiscDNA SplicedAlignments Arabidopsis

Protein SplicedAlignments

CpGAT Gene Predictions Published Gene Predictionsa

Total Cognateb Transcripts Loci Questionablec Transcripts Loci Questionablec

853 134834574 49947 44870 34629 25498 22698 254 28447 26521 558

See also httpgoblinxsoicindianaeduGDB002 for data display and downloadaSource ftpftpjgi-psforgpubcompgenphytozomev90CrubellaannotationCrubella_183_genegff3gzbThe single location with the best alignment score for a given query sequencecLess than 75 integrity score and greater than 75 coverage based on GAEVAL analysis (see Methods)


project on theproject homepage (httpgoblinxsoicindianaeduindexphp)

Public Repository

ThexGDBvmprojectmaintainsapresenceathttpbrendelgroupgithubioxGDBvm The xGDBvm-specific software can be ac-cessed and updated from httpsgithubcomBrendelGroupxGDBvmwheredevelopers cancontribute via pull requests andusers can screen pending issues and report new ones xGDBvm islicensed under GNU General Public License version 3 The re-pository includes case studies that illustrate real-world projectsimplemented using xGDBvm (httpsgithubcomBrendelGroupxGDBvmtreemastercase-studies)

DISCUSSION

xGDBvmrsquos Utility

As an all-in-one solution to genome annotation and analysisxGDBvm is unique among currently available packages Con-figured as a virtual server with a complete GUI interface and HPCcapabilities xGDBvm removes barriers to entry imposed by ex-tensive software installation testing and troubleshooting andcommand-lineoperation ThexGDBvmGUIguides inexperiencedusers by presenting only actionable choices and instructions ateachstep aswell asprovidingpreinstalledsampledatasets inputdata validation error flagging and extensive help pop-ups Datamanagement is handled entirely within the xGDBvmenvironmentallowing theuser to focuson theoverall annotation task rather thanmanaging intermediate inputoutput files The resulting websitecan be either public or password-protected as desired and thecontents can be archived shared or exported for display usingother genome display platformsWe expect that this combinationof features will make xGDBvm attractive to research groups withadesire toannotategenomedatabut limitedaccess to informaticssupport

There are several use cases for xGDBvm in its current imple-mentation at iPlant (1) Researchers with a newly assembledgenomewhocanquickly align relevant transcript assemblyandorprotein data to determine probable gene location and then per-formgenestructurecomputationoneitheraportionof thegenomeor the genome in its entirety resulting in a ldquofirst passrdquo genomeannotation (2) researcherswith a recently annotatedgenomewhowish to share it and improve annotation quality via communityannotation (3) researchers who wish to create their own copy ofa ldquofinishedrdquo genome annotation in order to run gene qualityanalyseswithup-to-date transcript data andor carryout targetedor general reannotation and (4) instructors desiring a hands-onenvironment for exploring the principles of genome annotationwith real data and access to HPC resources

In scope xGDBvm provides an easy-to-use and versatileplatform for annotating and analyzing genomes at various stagesof completion At one extreme a finished genome can be loadedfrom data files available online giving the user complete freedomto analyze and reannotate genes previously published At theopposite extreme a newly assembled genome can be loaded

together with related-species data andor short read assembliesand CpGAT can be invoked to automatically build a credible draftgenome annotation for further analysis With any implementationthe powerful built-in tools for gene quality analysis and rean-notation make xGDBvm a valuable asset for improving genomestructure annotation as wellAnother advantage of xGDBvm is its flexibility as it allows

multiple genomeviews tobecreated in one instance andsupportsupdates to any type of existing data Finally xGDBvm providesextensive documentation of the annotation and update processimportant both for troubleshooting and for reporting results

Comparison to Similar Tools

Other cloud-based annotation tools are available Maker (httpwwwyandell-laborgsoftwaremakerhtml) is a eukaryotic genomeannotation pipeline that can be installed in a variety of server en-vironments (Cantarel et al 2008) and a version ofMaker (Maker-P)is installed at iPlant Atmosphere as a virtual machine with links toHPC (httpspodsiplantcollaborativeorgwikidisplaysciplantMAKER-P+at+iPlant) The Web-based genome analysis plat-forms Galaxy (Goecks et al 2010) offers cloud installation viaAmazonrsquos Elastic Cloud Compute (EC2) service (httpsawsamazoncomec2) xGDBvm differs from these tools in that itoffers a comprehensive package combining a structured envi-ronment for data inputs automated data processing with sanitychecks and tools for genome display search and reannotationbuilt in

Limitations

Ascurrently configured xGDBvm is unable tomapshort readdataonto a genome so users will need to assemble short reads denovo prior to submitting data to xGDBvm as a TSA data setxGDBvmrsquos computational workflow can currently accommodateonly one track per spliced alignment data type (EST cDNA TSAprotein) and two tracks for gene model predictions Users whorequire additional tracksmust configure themmanually xGDBvmrsquosHPC processes are currently limited to spliced alignment com-putations whereas gene structure annotation via CpGAT is limitedby the processing power of the VMVM availability and usage at iPlant as well as access to HPC

resources can be expected to be limited based overall capacityand the amount of demand on the respective systems Userswishing to increase their usage quotas may be required to justifytheir request

Future Directions

xGDBvm is still being developed and improved The road mapincludes additional features such as modular data workflowsallowing unlimited track numbers and additional options for geneannotationandevaluation xGDBvmrsquos implementationof theAgaveAPI should facilitate the addition of new standalone or pipeline-integrated computation tools that can take advantage of high-performance processing (eg Maker)We also envision integratingxGDBvm with other analysis platforms including one that allowsvisualization of common introns (Wilkerson et al 2006)

852 The Plant Cell

METHODS

xGDBvm Architecture and Software

The xGDBvm architecture is shown in Figure 3 and a more detailed de-scriptioncanbe found in thewiki (httpgoblinxsoicindianaeduwiki)Wecurrentlymaintain twoparallel implementations of xGDBvm oneat IndianaUniversity (xGDBvm-GoblinX) on a virtual server using Red Hat EnterpriseLinux (httpwwwredhatcom) and the other on the iPlant Atmosphereplatform (xGDBvm-iPlant) using CentOS Linux (httpswwwcentosorg)Both implementations run Apache Web server (httpwwwapacheorg)with very similar configurations but xGDBvm-iPlant also includesopenSSL (httpswwwopensslorg) and Apachersquos mod_ssl for secureaccess over HTTPS Additional software includes MySQL clientserversoftware (httpswwwmysqlcom) Perl (httpwwwperlorg) and PHP(httpphpnet) to handle Web scripts and some server-side functionswith additional Perl modules for cgi and session management InstalledJavascript libraries include JQuery and JQuery UI (httpsjquerycom)BioPerl (httpwwwbioperlorgwikiMain_Page) and EMBOSS (httpembosssourceforgenet) were installed to handle certain operationsAdditional binaries including NCBI-BLAST+ (ftpftpncbinlmnihgovblastexecutablesblast+) as well as the computation-related softwaredescribed earlier were installed under usrlocalbin or usrlocalsrc (seeSupplemental Table 1 for a complete list of installed binaries)

The document root directory is xGDBvm under the VMrsquos root partitionxGDB scripts (modified from Schlueter et al [2006]) PHP scripts and otherassets (Javascriptfilescssfilesand images)were installedunder xGDBvmXGDBandadministrativescriptsunder xGDBvmadminWorkflow-relatedshell scripts are found under xGDBvmscripts and custom yrGATEGAEVAL and CpGAT packages were installed under xGDBvmsrc Theentiredocument rootcontents (excludingbinaries)aremaintainedasapublicrepository at GitHub (httpsgithubcomBrendelGroupxGDBvm)

The xGDBvm architecture is designed to segregate input data dy-namically generated output data and staticWeb scripts that comprise thexGDBvm core (Figure 3) The userrsquos Data Store directory (for inputssegregated under a common subdirectory xgdbvm) and block storagevolume (for outputs) are mounted under homexgdb-input and homexgb-data respectively These are symbolically linked to paths under thedocument root (xGDBvminput and xGDBvmdata) and all xGDBvmscripts reference these data paths for reading and writing data Datadestination directories are assignedownership by group ldquoxgdbrdquowith read-writeprivilegesandtheldquoapacherdquouser isaddedtotheldquoxgdbrdquogroupunderetcgroup Temporary data are saved to xGDBvmdatatmp

To provide secure transactions where passwords are being sent overthe Web xGDBvm-iPlant enforces HTTPS (with self-signed cert) on allpages Website password protection via htaccess is required upon initialconfiguration so only users who have the password can view the websiteonline Password protection can also be modified using the xGDBvmldquoAdminrdquo GUI to include just the ldquoManagerdquo functions (Admin ConfigureCreate and Remote Jobs) in this configuration the VMrsquos genomebrowsers and data download sections are public The backend MySQLpassword can also be customized via the GUI for additional site securityWeb access to themounted storage directories is blocked by the Apacheconfiguration so the userrsquos mounted disks are not exposed on the In-ternet Certain VM assets (OAuth2 credentials MySQL password) arestored under xGDBvmadmin which is protected via the Apacheconfiguration

Benchmarking xGDBvm

The hardmasked Capsella rubella assembly (Slotte et al 2013) wasdownloaded from the Joint Genome Initiative (ftpftpjgi-psforgpubcompgenphytozomev90CrubellaassemblyCrubella_183_hardmaskedfagz user account required)Arabidopsis thaliana cDNAFASTA sequences

weredownloaded fromNCBI (httpwwwncbinlmnihgovnuccoreterm=(ldquomrna+NOT+estrdquo5bFilter5d)+AND+Arabidopsis+thaliana5bOrganismrdquo

httpwwwncbinlmnihgovnuccoreterm=(ldquomrna+NOT+estrdquo[Filter])+AND+Arabidopsis+thaliana[Organism]) Predicted protein translationswere obtained from the Arabidopsis TAIR10 genome release (ftpftparabidopsisorghometairGenesTAIR10_genome_release) UniRef90proteins from Viridiplantae were retrieved in FASTA format from Uniprot(httpwwwuniprotorgunirefquery=uniprot(taxonomyviridiplantae)+identity09) and the file renamed as UniRef90-Viridiplantaefa A genomeannotation based on these input data was created on an xGDBvm instanceat iPlant with two CPUs and 4 GB RAM (httpsatmoiplantcollaborativeorgapplication) xGDBvmrsquos GeneSeqer parameters were species modelArabidopsis alignment stringency strict CpGAT parameters were BGFArabidopsis Augustus Arabidopsis GeneMark a_thaliana SkipMask = TFor comparison the current C rubella annotation (GFF3) was downloaded(ftpftpjgi-psforgpubcompgenphytozomev90CrubellaannotationCrubella_183_genegff3gz) and included as input in the genome work-flow Additional spliced alignment benchmarking and case studies usedGeneSeqer-MPI and GenomeThreader running on high-performancecomputing systems at Texas Advanced Computing (httpswwwtaccutexasedu) accessed from xGDBvm as public apps via the Agave API

For the second use case we queried the NCBI whole-genome shotgunsequence (wgs) library for Musa acuminata subsp malaccensis (bananahttpwwwncbinlmnihgovassemblyGCF_0003138551) using tblastn(httpblastncbinlmnihgovBlastcgiPROGRAM=tblastn) with a Se-taria italica predicted protein (XP_0049775561)M acuminata contig 839(GenBank accession CAIC010235861) was retrieved from NCBI (httpwwwncbinlmnihgovTraceswgsfdumpcgiCAIC0123586) the resultingfile was named Musa_contig_839gdnafa and the FASTA header wassimplified to ldquogtMusa_contig839rdquoM acuminata EST sequences in FASTAformat were retrieved from NCBI (httpwwwncbinlmnihgovnucestterm=Musa_acuminata5BOrganism5D]) and renamed as musa_estfa UniRef90 proteins from Viridiplantae were retrieved in FASTA format asdescribed above xGDBvmrsquosGeneSeqer parameters were speciesmodelrice alignment stringency strict CpGAT parameters were BGF riceAugustus maize GeneMark o_sativa Skip Mask = T

Data Access

Data sets described under Benchmarking can be viewed and download-ed from the xGDBvm project pages at httpgoblinxsoicindianaeduGDB002 (C rubella genome) and httpgoblinxsoicindianaeduGDB003 (Macuminatacontig839) A list of allWeb resources referencedin this manuscript is found in Supplemental Table 2

Supplemental Data

Supplemental Figure 1 Input data validation

Supplemental Figure 2 The xGDBvm automated workflow

Supplemental Figure 3 Output data validation

Supplemental Figure 4 Preconfigured example data sets

Supplemental Figure 5 Annotation of a single genomic contig

Supplemental Table 1 xGDBvm installed software

Supplemental Table 2 Hyperlinks referenced in the manuscript

ACKNOWLEDGMENTS

We thank Ann Fu for help with initial development of the automatedworkflow Shannon Schlueter for advice in adapting his XGDB corecode for the virtual environment James Denton for extensive debugging


and yrGATE feature development Jianqing Guan for code to calculatedynamic GAEVAL scores and Bruce Shei for system support at IndianaUniversity We especially thank collaborators and colleagues at the iPlantCollaborative (CyVerse) and Texas Advanced Computing Center (TACC)for their assistance in integrating xGDBvm into the Atmosphere cloud en-vironment and the Agave API Roger Barthelson and Shabari Subramaniamwho wrote and tested HPC wrapper scripts for GeneSeqer-MPI andGenomeThreader respectively Andre Mercer who provided prototypePHP scripts for the API and Edwin Skidmore Rion Dooley and MatthewVaughn who provided system troubleshooting and advice This work wassupported by National Science Foundation Award 1221984 to VPB

AUTHOR CONTRIBUTIONS

VPB conceived the project and provided overall guidance JD carriedout the project and managed collaborations DSS tested xGDBvmfunctionality with actual data sets configured and extended a productionxGDBvmserver ranParsEval comparisons and contributed someparsingscripts NM provided guidance for xGDBvmrsquos implementation at iPlantand created the prototype HPC wrapper scripts

Received November 2 2015 revised February 29 2016 accepted March25 2016 published March 28 2016

REFERENCES

Abouelhoda MI Kurtz S and Ohlebusch E (2002) The en-hanced suffix array and its applications to genome analysis InSecond Workshop on Algorithms in Bioinformatics R Guigo and DGusfield eds (RomeSpringer-Verlag) pp 449ndash463

Borodovsky M and Lomsadze A (2011) Eukaryotic gene pre-diction using GeneMarkhmm-E and GeneMark-ES Curr ProtocBioinformatics 4 461ndash4610

Cantarel BL Korf I Robb SM Parra G Ross E Moore BHolt C Saacutenchez Alvarado A and Yandell M (2008) MAKERan easy-to-use annotation pipeline designed for emerging modelorganism genomes Genome Res 18 188ndash196

Dooley R Vaughn M Stanzione D Terry S and Skidmore E (2012)Software-as-a-Service The iPlant Foundation API In 5th IEEE Workshopon Many-Task Computing on Grids and Supercomputers (MTAGS) (IEEE)

Foissac S Gouzy JP Rombauts S Matheacute C Amselem JSterck L Van de Peer Y Rouzeacute P and Schiex T (2008)Genome annotation in plants and fungi EuGene as a model plat-form Curr Bioinform 3 87ndash97

Goecks J Nekrutenko A and Taylor J Galaxy Team (2010)Galaxy a comprehensive approach for supporting accessible re-producible and transparent computational research in the life sci-ences Genome Biol 11 R86

Goff SA et al (2011) The iPlant Collaborative Cyberinfrastructurefor plant biology Front Plant Sci 2 34

Gremme G Brendel V Sparks ME and Kurtz S (2005) Engi-neering a software tool for gene structure prediction in higher or-ganisms Inf Softw Technol 47 965ndash978

Grigoriev IV et al (2012) The genome portal of the Department ofEnergy Joint Genome Institute Nucleic Acids Res 40 D26ndashD32

Haas BJ Delcher AL Mount SM Wortman JR Smith RKJr Hannick LI Maiti R Ronning CM Rusch DB TownCD Salzberg SL and White O (2003) Improving the Arabi-dopsis genome annotation using maximal transcript alignment as-semblies Nucleic Acids Res 31 5654ndash5666

Haas BJ Salzberg SL Zhu W Pertea M Allen JE Orvis JWhite O Buell CR and Wortman JR (2008) Automated eu-karyotic gene structure annotation using EVidenceModeler and theProgram to Assemble Spliced Alignments Genome Biol 9 R7

Hammesfahr B Odronitz F Hellkamp M and Kollmar M(2011) diArk 20 provides detailed analyses of the ever increasingeukaryotic genome sequencing data BMC Res Notes 4 338

Hoff KJ Lange S Lomsadze A Borodovsky M and Stanke M(2015) BRAKER1 Unsupervised RNA-Seq-based genome annotationwith GeneMark-ET and AUGUSTUS Bioinformatics 32 767ndash769

Holt C and Yandell M (2011) MAKER2 an annotation pipeline andgenome-database management tool for second-generation ge-nome projects BMC Bioinformatics 12 491

Leroy P et al (2012) TriAnnot A versatile and high performance pipelinefor the automated annotation of plant genomes Front Plant Sci 3 5

Mungall CJ et al (2002) An integrated computational pipeline anddatabase to support whole-genome sequence annotation GenomeBiol 3 00811ndash008111

Nocq J Celton M Gendron P Lemieux S and Wilhelm BT(2013) Harnessing virtual machines to simplify next-generationDNA sequencing analysis Bioinformatics 29 2075ndash2083

Potter SC Clarke L Curwen V Keenan S Mongin E SearleSM Stabenau A Storey R and Clamp M (2004) The En-sembl analysis pipeline Genome Res 14 934ndash941

Reddy TBK Thomas AD Stamatis D Bertsch J Isbandi MJansson J Mallajosyula J Pagani I Lobos EA and Kyrpides NC(2015) The Genomes OnLine Database (GOLD) v5 a metadata man-agement system based on a four level (meta)genome project classificationNucleic Acids Res 43 D1099ndashD1106

Schlueter SD Wilkerson MD Dong Q and Brendel V (2006)xGDB open-source computational infrastructure for the integratedevaluation and analysis of genome features Genome Biol 7 R111

Schlueter SDWilkersonMD Huala E Rhee SY andBrendel V (2005)Community-based gene structure annotation Trends Plant Sci 10 9ndash14

Slotte T et al (2013) The Capsella rubella genome and the genomicconsequences of rapid mating system evolution Nat Genet 45 831ndash835

Specht M Stanke M Terashima M Naumann-Busch BJanssen I Houmlhner R Hom EF Liang C and Hippler M(2011) Concerted action of the new Genomic Peptide Finder andAUGUSTUS allows for automated proteogenomic annotation of theChlamydomonas reinhardtii genome Proteomics 11 1814ndash1823

Standage DS and Brendel VP (2012) ParsEval parallel comparisonand analysis of gene structure annotations BMC Bioinformatics 13 187

Stanke M Keller O Gunduz I Hayes A Waack S andMorgenstern B (2006) AUGUSTUS ab initio prediction of alter-native transcripts Nucleic Acids Res 34 W435ndashW439

Thibaud-Nissen F Souvorov A Murphy T DiCuccio M and Kitts P(2013) Eukaryotic Genome Annotation Pipeline In The NCBI Handbook2nd ed (Bethesda MD National Center for Biotechnology Information)httpwwwncbinlmnihgovbooksNBK169439

Uberbacher EC Hyatt D and Shah M (2004) GrailEXP andGenome Analysis Pipeline for genome annotation Curr ProtocHum Genet 39 651ndash6515

Usuka J Zhu W and Brendel V (2000) Optimal spliced alignment ofhomologous cDNA to a genomic DNA template Bioinformatics 16 203ndash211

Wang BB OrsquoToole M Brendel V and Young ND (2008)Cross-species EST alignments reveal novel and conserved alter-native splicing events in legumes BMC Plant Biol 8 17

Wilkerson MD Schlueter SD and Brendel V (2006) yrGATEa web-based gene-structure annotation tool for the identificationand dissemination of eukaryotic genes Genome Biol 7 R58

Yandell M and Ence D (2012) A beginnerrsquos guide to eukaryoticgenome annotation Nat Rev Genet 13 329ndash342

854 The Plant Cell

DOI 101105tpc1500933 originally published online March 28 2016 201628840-854Plant Cell

Jon Duvick Daniel S Standage Nirav Merchant and Volker P BrendelxGDBvm A Web GUI-Driven Workflow for Annotating Eukaryotic Genomes in the Cloud

This information is current as of June 3 2020

Supplemental Data contentsuppl20160328tpc1500933DC1html

References content284840fullhtmlref-list-1

This article cites 28 articles 2 of which can be accessed free at

Permissions httpswwwcopyrightcomcccopenurldosid=pd_hw1532298Xampissn=1532298XampWTmc_id=pd_hw1532298X

eTOCs httpwwwplantcellorgcgialertsctmain

Sign up for eTOCs at

CiteTrack Alerts httpwwwplantcellorgcgialertsctmain

Sign up for CiteTrack Alerts at

Subscription Information httpwwwaspborgpublicationssubscriptionscfm

is available atPlant Physiology and The Plant CellSubscription Information for

ADVANCING THE SCIENCE OF PLANT BIOLOGY copy American Society of Plant Biologists

server operating system the application software componentsalong with all requisite software dependencies and configurationsettings all of which are stored (ldquoimagedrdquo) in such a way that theycan be copied and launched by means of commonly availablevirtualization tools and made available to anyone with access tovirtual server software such as KVM (httpwwwlinux-kvmorg)or VirtualBox (httpswwwvirtualboxorg) VMs have a numberadvantages for complex informatics analysis (Nocqetal 2013) ofwhich thepreinstallation of all required software for complex tasksaswell as temporary access to all the computer resources neededfor completion of the task are of most practical value for a typicalbiologist user Cloud computing platforms such as OpenStack(httpswwwopenstackorg) and Docker (httpswwwdockercom) offer VM and container-based technologies that can bemanaged accessed remotely and readily deployed on com-mercial cloud-based services such as Amazon Web Services(httpsawsamazoncom) Government-funded consortia suchas the iPlant Collaborative (now CyVerse) (Goff et al 2011) makesuch virtual platforms readily accessible to individual users via theinternet

Although most genome researchers are familiar with a widerange of online tools to evaluate sequence data they will notnecessarily know how to put them together and configure themappropriately Ideally an annotation platform should have a co-hesive graphical user interface (GUI) that guides the user throughsetup configuration parameter setting and status reportingImportantly all setup and processing steps should be managedwith data sanity checks (for completeness and format) context-dependent menus error logging and reporting and help docu-mentationtutorials

Ability toedit and improveautomatedannotation shouldbebuiltin This means the ability both to add additional data once theworkflow has completed and to modify individual annotations insuch a way that the most critical regions of the genome are wellannotated are needed

With variable parameters and source data sets automateddocumentation and simple archiving are essential for ensuringrepeatability of the genome annotation process

With large genomes and large transcript data sets computa-tions such as spliced alignment can take days or weeks ona typical lab computer whereas with access to high-performancecomputing (HPC) resources the process can be completed ina few hours Many research facilities have such resources buttheir use is complex and not necessarily available to any re-searcher who might be interested

Once computation is complete the annotated genome and itsinputoutput files should be available online either to a selectcommunity (with password access) or to the research com-munity at a whole thus placing output data andor communityannotation tools in the hands of the target audience in a timelymanner

With the above attributes in mind we created a self-containedgenome annotation platform xGDBvm for use by the researchcommunity We report below our initial release of xGDBvm in theiPlant (CyVerse) Atmosphere cloud infrastructure (httpwwwiplantcollaborativeorgciatmosphere) as an on-demand virtualserver for genome annotation that can be adapted for wide rangeof research needs

RESULTS

Overview of xGDBvm

xGDBvm is a Linux-based platform that accepts genomic andtranscript andor protein sequence inputs and creates a genomeannotation that can be displayed in the included full-featuredgenome browser with separate tracks for genome segmentstranscript and protein alignments gene predictions and repeatmasked regions (Figure 1) xGDBvm uses a modified and ex-tended version of the xGDB (Extensible Genome Data Broker)Webplatform (Schlueter et al 2006)written inPerl andPHP alongwith aWeb server workflow automation scripts and executablespackaged together as a virtual server and configured for accessover HTTP or HTTPS via a GUI xGDBvm is compact in sizeoccupying 13 gigabytes (GB) of a typical 20-GB VM root par-tition Data inputsoutputs are preferably stored on external vol-umesmounted to the VM thus alleviating constraints on VM sizeComputational processes in xGDBvm (Figure 2) are managed

by automated user-configurable workflows with a built-in optionfor calls to HPC resources Optional masking of genome seg-ments is performedusing Vmatch (Abouelhoda et al 2002) basedon user-provided masking libraries Spliced alignment of transcriptsand proteins to the genome are computed using GeneSeqer(Usuka et al 2000) and GenomeThreader (Gremme et al2005) respectively xGDBvm optionally creates gene modelpredictions using CpGAT (Comprehensive Gene Annotation ToolhttpplantgdborgAtGDBcgi-binWebCpGATpl) a set of scriptsand binaries that integrates spliced alignment data and ab initiogene predictions along with BLAST similarity filters and alternativestructures to derive a high-quality gene prediction data set ThexGDBvmworkflowcan also upload precomputedgene predictionsfrom a user-providedGFF3-formatted file All steps are logged anddisplayed dynamically during workflow operation Once completeeach feature is displayed as a separate track in a fully featuredgenomebrowser completewith searchdownload tools and tabularfeature views A quality score assigned to each annotated locusfacilitates the identificationof low-qualitymodelswhichcanthenbereannotated and curated using the built-in yrGATE annotation tool(Wilkerson et al 2006) Additional genomes can be configured andcreated with the same VM and the user can archive and retrievesingle or global data sets Any data type can be appended or re-placed using an ldquoupdaterdquo feature The outcome is a rich editableenvironment for genome exploration and annotation accessiblelocally or remotely on the Web (for feature overview see Table 1)

xGDBvm-iPlant

We implemented xGDBvmas aVM image on iPlantrsquos Atmospherecloud platform (httpsatmoiplantcollaborativeorgapplication)available to registered life sciences researchers (httpwwwiplantcollaborativeorgcontentacceptable-use-policy) We fur-ther customized the VM taking advantage of iPlantrsquos data and jobexecution application programming interfaces (APIs) makingxGDBvm a one-stop destination for genome annotation anddisplay Registered iPlant users can create and configure anxGDBvm instance via the Atmosphere control panel and thenaccess the xGDBvm instance via a Web browser to perform all




842 The Plant Cell


































844 The Plant Cell






















846 The Plant Cell















848 The Plant Cell






Benchmarking xGDBvm






Genome Region



iPlant


Indiana University


850 The Plant Cell






GenomeSegments

Total Length(bp)





853 134834574 49947 44870 34629 25498 22698 254 28447 26521 558




Public Repository


DISCUSSION









Limitations



Future Directions


852 The Plant Cell

METHODS






Benchmarking xGDBvm





Data Access


Supplemental Data








ACKNOWLEDGMENTS







REFERENCES
































854 The Plant Cell

















842 The Plant Cell


































844 The Plant Cell






















846 The Plant Cell















848 The Plant Cell






Benchmarking xGDBvm






Genome Region



iPlant


Indiana University


850 The Plant Cell






GenomeSegments

Total Length(bp)





853 134834574 49947 44870 34629 25498 22698 254 28447 26521 558




Public Repository


DISCUSSION









Limitations



Future Directions


852 The Plant Cell

METHODS






Benchmarking xGDBvm





Data Access


Supplemental Data








ACKNOWLEDGMENTS







REFERENCES
































854 The Plant Cell
















































844 The Plant Cell






















846 The Plant Cell















848 The Plant Cell






Benchmarking xGDBvm






Genome Region



iPlant


Indiana University


850 The Plant Cell






GenomeSegments

Total Length(bp)





853 134834574 49947 44870 34629 25498 22698 254 28447 26521 558




Public Repository


DISCUSSION









Limitations



Future Directions


852 The Plant Cell

METHODS






Benchmarking xGDBvm





Data Access


Supplemental Data








ACKNOWLEDGMENTS







REFERENCES
































854 The Plant Cell




































846 The Plant Cell















848 The Plant Cell






Benchmarking xGDBvm






Genome Region



iPlant


Indiana University


850 The Plant Cell






GenomeSegments

Total Length(bp)





853 134834574 49947 44870 34629 25498 22698 254 28447 26521 558




Public Repository


DISCUSSION









Limitations



Future Directions


852 The Plant Cell

METHODS






Benchmarking xGDBvm





Data Access


Supplemental Data








ACKNOWLEDGMENTS







REFERENCES
































854 The Plant Cell





























848 The Plant Cell






Benchmarking xGDBvm






Genome Region



iPlant


Indiana University


850 The Plant Cell






GenomeSegments

Total Length(bp)





853 134834574 49947 44870 34629 25498 22698 254 28447 26521 558




Public Repository


DISCUSSION









Limitations



Future Directions


852 The Plant Cell

METHODS






Benchmarking xGDBvm





Data Access


Supplemental Data








ACKNOWLEDGMENTS







REFERENCES
































854 The Plant Cell




















Benchmarking xGDBvm






Genome Region



iPlant


Indiana University


850 The Plant Cell






GenomeSegments

Total Length(bp)





853 134834574 49947 44870 34629 25498 22698 254 28447 26521 558




Public Repository


DISCUSSION









Limitations



Future Directions


852 The Plant Cell

METHODS






Benchmarking xGDBvm





Data Access


Supplemental Data








ACKNOWLEDGMENTS







REFERENCES
































854 The Plant Cell




















GenomeSegments

Total Length(bp)





853 134834574 49947 44870 34629 25498 22698 254 28447 26521 558




Public Repository


DISCUSSION









Limitations



Future Directions


852 The Plant Cell

METHODS






Benchmarking xGDBvm





Data Access


Supplemental Data








ACKNOWLEDGMENTS







REFERENCES
































854 The Plant Cell
















Public Repository


DISCUSSION









Limitations



Future Directions


852 The Plant Cell

METHODS






Benchmarking xGDBvm





Data Access


Supplemental Data








ACKNOWLEDGMENTS







REFERENCES
































854 The Plant Cell















METHODS






Benchmarking xGDBvm





Data Access


Supplemental Data








ACKNOWLEDGMENTS







REFERENCES
































854 The Plant Cell



















REFERENCES
































854 The Plant Cell















xgdbvm: a web gui-driven workflow for annotating ...large-scale biology article xgdbvm: a web...

Documents