journal club aug04 2015 genemarket

Journal Club GeneMark-ET

Hiroya Morimoto Aug4, 2015

BRAKER1 �  BRAKER1

�  Mario Stanke らが新規構築した遺伝子予測pipeline �  AUGUSTUS, GeneMark-ET が重要な役割を持つ．

�  Notable feature �  Exploit RNA-Seq data

w/o its assembly

BRAKER1 Pipeline

Today’s paper �  GeneMark-ET を発表した論文

Published online 02 July 2014 Nucleic Acids Research, 2014, Vol. 42, No. 15 e119doi: 10.1093/nar/gku557

Integration of mapped RNA-Seq reads into automatictraining of eukaryotic gene finding algorithmAlexandre Lomsadze1, Paul D. Burns1 and Mark Borodovsky1,2,3,*

1Joint Georgia Tech and Emory Wallace H. Coulter Department of Biomedical Engineering, Atlanta, GA, USA 30332,2School of Computational Science and Engineering, Georgia Tech, Atlanta, GA, USA 30332 and 3Department ofBioinformatics, Moscow Institute of Physics and Technology, Moscow, Russia 141700

Received September 22, 2013; Revised June 4, 2014; Accepted June 10, 2014

ABSTRACT

We present a new approach to automatic trainingof a eukaryotic ab initio gene finding algorithm.With the advent of Next-Generation Sequencing, au-tomatic training has become paramount, allowinggenome annotation pipelines to keep pace with thespeed of genome sequencing. Earlier we developedGeneMark-ES, currently the only gene finding algo-rithm for eukaryotic genomes that performs auto-matic training in unsupervised ab initio mode. Thenew algorithm, GeneMark-ET augments GeneMark-ES with a novel method that integrates RNA-Seqread alignments into the self-training procedure. Useof ‘assembled’ RNA-Seq transcripts is far from triv-ial; significant error rate of assembly was revealedin recent assessments. We demonstrated in compu-tational experiments that the proposed method ofincorporation of ‘unassembled’ RNA-Seq reads im-proves the accuracy of gene prediction; particularly,for the 1.3 GB genome of Aedes aegypti the meanvalue of prediction Sensitivity and Specificity at thegene level increased over GeneMark-ES by 24.5%. Inthe current surge of genomic data when the need foraccurate sequence annotation is higher than ever,GeneMark-ET will be a valuable addition to the nar-row arsenal of automatic gene prediction tools.

INTRODUCTION

Accurate ab initio algorithms are indispensable tools forannotation of eukaryotic genomes since they can iden-tify genes not supported by reliable external evidence (e.g.(1–7)). The predictive power of ab initio algorithms is afunction of rational algorithm design as well as optimalassignment of species specific parameters. Current Next-Generation Sequencing (NGS) technologies require geneprediction methods that keep pace with the speed of se-quencing; ab initio algorithms with automatic training offer

this advantage. In this paper, we introduce a new approachto automatic training of a eukaryotic ab initio gene findingalgorithm.

Conventional supervised training techniques are centeredon preparation of expert validated training sets; this te-dious step now takes more time than genome sequencingitself. Earlier on, the genes compiled into training sets werevalidated by alignments of the Expressed Sequence Tags(EST) and full length cDNAs sequences. The advent ofNGS raised expectations that full length transcripts couldbe rapidly and confidently produced by assembling RNA-Seq reads. However, assembly of RNA-Seq reads turnedout to be non-trivial. The RNA-seq Genome AnnotationAssessment Project (RGASP) Consortium (8) comprehen-sively assessed the accuracy of transcript reconstructionfrom RNA-Seq data by several algorithms. The results con-firmed an informal consensus among developers, that as-sembly of transcripts from RNA-Seq reads is prone to fre-quent errors. Thus, mapping assembled transcripts as a fastand accurate approach to gene finding or even to buildingtraining sets for ab initio gene finders has serious feasibil-ity issues. Consequently, expert control is still necessary inthe creation of training sets for the algorithms that employsupervised training.

A supervised gene collection naturally starts from morereadily validated genes and is thus likely to be biased to-wards highly expressed genes, genes with rather long exons,evolutionary conserved ‘core’ genes (9), etc. Unsupervisedtraining (self-training) eliminates expert controlled super-vised training.

The major target of unsupervised training procedure isthe accumulation (over iterations) of a larger and largerset of correctly predicted genes. Finally, the resulting set ofgenes serves as the training set for the final estimation of thealgorithm parameters. Ideally, this method provides greater‘ease’ in generating large training sets in comparison withexpert controlled accumulation of validated genes. Devia-tion to an incorrect convergence point is a major risk factorin unsupervised training. It could result in a training set thatincludes erroneously defined genes or overrepresented sub-sets of genes of certain type; this kind of issue, however, ex-

*To whom correspondence should be addressed. Tel: +1 404 894 8432; Fax: +1 404 894 0519; Email: [email protected]

C⃝ The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), whichpermits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

at (D) Tokyo K

ogyo Daigaku Library on July 31, 2015

http://nar.oxfordjournals.org/D

ownloaded from

GeneMark-ET �  GeneMark-ES

�  What tool is it? �  An ab initio gene predictor

�  Feature �  Unsupervised leaning �  Input されたgenome seq. から，自動的にパラメータを学習して，設定して遺伝子予測を実行してくれる．

�  GeneMark-ET �  Features

�  ‘unassembled RNA-Seq data’ を学習に利用可能 � 基本的にはGeneMark-ES で，unsupervised leaning.

Unsupervised training

Supervised gene collection

Give ‘correct’ direction Validated な遺伝子を利用するため正確

(train set 内の遺伝子は正しい)

Bias の存在 Validated な遺伝子を利用するため，遺伝子長が長く発現量の高い遺伝子を

利用してしまう傾向あり

Huge costs Expert 人材が必要

Train set の構築に時間と費用のコスト増

Accuracy Train set によるbias, error と無縁

Low cost Train set 構築の手間・時間が不要

Data 量が多いほど，その強みが増す．

Unsupervised training �  Unsupervised trainingはbias free かつ高速な手法

Deviation to an incorrect convergence point 共に遺伝子情報に含有されるerror や重複した情報が誤った学習を生む可能性あり．

Error prone Assembly によってと大量のerror 発生

Utilize RNA-Seq data �  Unsupervised training

�  より長いgenome (>300Mb) においては結果が悪化． � より長いintron やIR, そしてrepetitive regions が原因 � 大きなgenome 程,data がfragmented であることも要因．

�  Unsupervised training + unassembled RNA-Seq data �  以下の要素に対して，高いrobust 性を示した．

� 多様な長さで存在するintrons, repeat content, 断片的なgenome assembly

�  GeneMark-ES に対しても高い性能を示した．

Genome size 増加とcontent �  Non-coding regions のgenome 中に占める割合の増加に従い，non-coding element (intron, IR) の平均長は増加するが，coding element (CDS) は依存関係にない �  Genome 中でnon-coding regions の占める割合はgenome

size に依存して変化するため，genome size とnon-coding element の平均長は相関し，coding element はしない，と考えられる．

e119 Nucleic Acids Research, 2014, Vol. 42, No. 15 PAGE 2 OF 8

Figure 1. The dot plot graph depicting average lengths of exons, intronsand intergenic regions against the value of percentage of non-coding DNAin a given genome was made for the five insect genomes used in theGeneMark-ET tests as well as for several other eukaryotic species. Theaverage lengths of intron and intergenic regions correlate with the genomelength while the average length of protein-coding exons (CDS) does notshow dependence on the genome size.

ists for supervised training as well. A chosen strategy of un-supervised training can be assessed on test sets; notably, thetest sets are needed for algorithms using supervised train-ing. Compilation of the test set (relatively small in compar-ison with the training set) does not present an extra effortspecific for unsupervised training.

Identifying the best strategy for unsupervised training isan interesting subject. Earlier we demonstrated feasibility ofeffective unsupervised training for fungal genomes as wellas for compact (<300 Mb) genomes of plants and animals(5,6). Still, we observed that the performance of unsuper-vised training may degrade for large eukaryotic genomes(>300 Mb in size) that have greater average length of intronand intergenic regions as well as large repeat populations.High genome assembly fragmentation may also present dif-ficulty for unsupervised training.

In what follows we show how an unsupervised train-ing procedure can use spliced alignments of ‘unassembled’RNA-Seq reads (rather than assembled transcripts) to im-prove accuracy of parameter estimation and gene predic-tion. The key point in combining two independent methods,ab initio gene prediction and RNA-Seq read mapping, is in-troduction of the notion of ‘anchor splice sites’: sites sup-ported by both ab initio gene prediction and by RNA-Seqread alignment. Contrary to existing training methods thatrely on training sets consisting of complete or almost com-plete gene structures, the new algorithm uses sets of geneelements, exons and introns, supported by anchor splicesites, to form reliable training sets in the iterative cycles ofthe model re-training. The new algorithm was tested ongenomic and transcriptomic data of several insect species,Aedes aegypti, Anopheles gambiae, Anopheles stephensi,Culex quinquefasciatus and Drosophila melanogaster, whichvary significantly by average size of introns and inter-genic regions (Figure 1). We have demonstrated that RNA-Seq support in training boosts GeneMark-ET performance

in gene prediction to a higher level in comparison withGeneMark-ES that uses purely unsupervised training. Theparameter estimation procedure did show robust perfor-mance with respect to variations in the size of the setof mapped introns, repeat content and fragmentation ofgenome assembly. The new method, used stand-alone or aspart of a pipeline, should streamline and accelerate the an-notation process in large genomes while improving the ac-curacy of gene identification.

MATERIALS AND METHODS

Data sets

We downloaded from VectorBase (10) sequences and an-notations of the genomes of four mosquito species A. ae-gypti (11), A. gambiae (12), A. stephensi (GenBank: PR-JNA167914, submitted by Virginia Tech) and C. quinque-fasciatus (13). The annotated euchromatin portion of the D.melanogaster genome (14) was downloaded from FlyBase(15). Chromosomes 4 and X of D. melanogaster and A. gam-biae were excluded. RNA-Seq datasets for the five specieswere obtained from the GenBank repository of short reads(16). Characteristics of genome sequences and RNA-seqdata sets used in the computational experiments are shownin Table 1.

To mask repetitive sequence in the mosquito genomes, weused repeat annotations provided by VectorBase (10). Tomask repeats in D. melanogaster we used RepeatMasker (17)and the D. melanogaster genome specific library of repeatsfrom RepBase (18).

The test sets (Table 2) were prepared as follows. In the fiveinsect genomes we selected genes with annotated CDS and3′ and 5′ UTRs. We excluded particular genes if (i) anno-tated exons overlapped with masked repeats; (ii) annotatedexons (including exons in 3′ and 5′ UTRs) overlapped withother annotated genes; (iii) annotated exons or introns werevery short (<6 nt or <20 nt, respectively); (iv) annotated in-trons were very long (>10 000 nt); (v) annotated exons hadread through stop codons, or splice sites had non-canonical(non GT-AG) dinucleotides; (vi) annotated genes had alter-native isoforms; (vii) protein products had BLAST hits tothe transposable element (TE) protein database; (viii) thebest BLAST hit of the protein product to NR database didnot have E-value better than 0.001. In the tests, the geneprediction programs were run on full genomes, and predic-tions were compared with annotations of the selected set ofgenes.

GeneMark-ET algorithm outline

The input data include assembled genomic sequences andRNA-Seq reads as shown in the diagram of GeneMark-ETalgorithm (Figure 2). Effectively, the use of mapped RNA-Seq reads, the external (extrinsic) evidence, changes the un-supervised training algorithm GeneMark-ES into an algo-rithm with semi-supervised training, GeneMark-ET.

At the first step, RNA-Seq reads are aligned to the ge-nomic sequence using a short read alignment algorithm; inour experiments we used several tools, TopHat (19), True-Sight (20) and UnSplicer (21). Only spliced alignments,those that identify splice junctions in RNA-Seq reads, the

at (D) Tokyo K



ownloaded from

Fig. 1

Materials and Methods 1. Data sets & test sets

2. GeneMark-ET algorithm outline 3. Algorithms in detail

Data sets �  5 insect genomes

�  Fruit fly, D. melanogaster �  4 mosquito sp.:

�  A. aegypti(ネッタイシマカ), A. gambiae(ガンビエハマダラカ), A. stephensi (ステフェンスハマダラカ), C. quinquefasciatus (ネッタイイエカ) PAGE 3 OF 8 Nucleic Acids Research, 2014, Vol. 42, No. 15 e119

Table 1. Characteristics of the five insect genomes and RNA-seq datasets

Species Genome sequence RNA-seq

Version

Assemblylength(Mb)

Unknownletters(Mb)

Masked seq(Mb)

‘atcg’ seq(Mb)

Number ofgaps Source Read type

Readcount,(mil-lions)

Aedes aegypti AaegL1 1384 74 871 439 36 200 SRR388682 83 nt, single 27.9Anopheles gambiae AgamP3 273 21 45 207 16 818 SRR520428 85 nt, paired 36.9Anopheles stephensi AsteV1 208 49 11 148 33 018 SRR643416 84 nt, paired 18.0Culexquinquefasciatus

CpipJ1 528 36 288 204 44 351 SRR364516 50 nt, paired 37.2

Drosophilamelanogaster

R5 120 0.1 9 111 8 SRR042297 75 nt, paired 13.6

Assembly length includes ‘N’ letters. The ‘atcg’ sequence is the genomic sequence left after masking of repeats. The number of gaps includes gaps withknown and unknown length.

Table 2. Numbers of protein coding genes in the genome annotation and in the test set

Species Annotation version Number of genes in Introns in test set

Annotation Test set

Aedes aegypti AaegL1.3 15 998 216 374Anopheles gambiae AgamP3.6 12 810 420 1061Anopheles stephensi AsteV1.0 21 785 317 939Culex quinquefasciatus CpipJ1.3 18 955 360 460Drosophila melanogaster r5.48 13 842 494 790

Spliced alignment

Mapped splice sites

Parameters of - Donor and

acceptor states - Intron state

Heuristic parameters

Genome sequence

RNA-seq reads

Initial HMM parameters

eurist Splice

Mappeial HM

Parameter re-estimation

GeneMark.hmm gene prediction

t

Gene

Figure 2. Diagram of the iterative semi-supervised training ofGeneMark-ET.

two nucleotides that appear together in mRNA after intronsplicing, are of importance for the training algorithm.

The general training logic of GeneMark-ET is similarto that of GeneMark-ES (5). First, using an initially de-fined set of parameters of the hidden semi-Markov model(HSMM) the algorithm predicts protein-coding regions inthe chosen genomic sequence. Second, a subset of the newlypredicted genes and non-coding regions is selected andused for the HSMM parameter re-estimation. Next, theprediction and re-estimation steps are repeated to conver-gence. GeneMark-ET differs from GeneMark-ES in themethod of selection of the more reliably predicted codingand non-coding regions used for parameter re-estimation.In GeneMark-ET, inclusion of a likely protein-coding exonin the training set requires the predicted exon to have at leastone ‘anchor splice site’ (Figure 3). ‘Anchor splice sites’ arethose predicted independently by both methods, by the abinitio one and by RNA-Seq read alignment.

As an exception to the rule, predicted exons >800 nt (evenun-anchored) are selected for estimation of emission prob-abilities of the HSMM protein-coding states. To accountfor possible misplacement of a predicted exon boundary, se-quences of the ‘long’ exons are trimmed (by 60 nt) on bothends.

Parameters of models of translation initiation and termi-nation site are estimated by using ‘anchored’ initial and ter-minal exons, respectively.

Parameters of the non-coding region model are derivedfrom sequences of introns having both splice sites anchoredas well as from predicted intergenic sequences situated be-tween two anchored border (initial or terminal) exons ofadjacent genes.

All the HSMM parameters mentioned above are emis-sion probability parameters. Probabilities of transition be-

at (D) Tokyo K



ownloaded from

Data sets �  Data sources

�  Genome seq. & annotations �  D. mel -> from FlyBase �  Just annotated euchromatin portion of them �  Chr. 4 and X were excluded. �  Repeat masking: use RepeatMasker and D.mel specific library of repeats (RepBase)

�  4 mosquito sp. -> from VectorBase �  Chr. 4 and X were excluded. (A. gambiae) �  Repeat masking: use repeat annotations (VectorBase)

�  RNA-Seq data �  All sp. -> from SRA

Test sets �  Preparation

�  Selected genes with; �  Annotated CDS and both UTRs

�  Excluded particular genes if; �  Exon の一部がrepeat masking された場合． �  Exon (UTR, coding region 問わず) が，他の遺伝子とoverlap する場合．

�  Exon/intron 長が，それぞれ<6nt, <20nt であった場合． �  Intron 長が>10,000nt であった場合． �  Exon でreadthrough が発生している場合． (stop codon を無視してtranslate されている) �  Splice sites がGT-AG 則を守っていない場合． �  遺伝子がalternative isoforms を持つ場合． �  その遺伝子のprotein product が，TE protein DB にBLAST hits を持つ場合．

�  その遺伝子のprotein product が，NRDB に対してe-value < 0.001 のBLAST hits を持たない場合．

Test sets �  Evaluation methods

�  遺伝子予測は，用意したgenome 全体に対して行う． �  評価自体は，test sets に存在する遺伝子についてのみ行い，annotation と比較する．

�  注意: 評価結果は見かけ上よくなる可能性がある．

PAGE 3 OF 8 Nucleic Acids Research, 2014, Vol. 42, No. 15 e119



Version

Assemblylength(Mb)

Unknownletters(Mb)

Masked seq(Mb)

‘atcg’ seq(Mb)






R5 120 0.1 9 111 8 SRR042297 75 nt, paired 13.6




Annotation Test set


Spliced alignment

Mapped splice sites




Genome sequence

RNA-seq reads


eurist Splice

Mappeial HM



t

Gene








at (D) Tokyo K



ownloaded from

GeneMark-ET algorithm outline �  Input data

�  Genome seq. �  RNA-Seq reads

�  Spliced alignment tools �  TopHat / TrueSight / UnSplicer

�  HSMM




Version

Assemblylength(Mb)

Unknownletters(Mb)

Masked seq(Mb)

‘atcg’ seq(Mb)






R5 120 0.1 9 111 8 SRR042297 75 nt, paired 13.6




Annotation Test set


Spliced alignment

Mapped splice sites




Genome sequence

RNA-seq reads


eurist Splice

Mappeial HM



t

Gene








at (D) Tokyo K



ownloaded from

C.intestinalis, C.reinhardtii and T.gondii the above statedrules of admission to a test set did not produce many recordswith multiple genes. Therefore, these test sets containedmostly one validated gene per sequence. The sizes (interms of number of genes) of the test sets are as follows:A.gambiae—144, A.thaliana—1026, C.elegans—183,C.intestinalis—314, C.reinhardtii—43, D.melanogaster—361 and T.gondii—65.

GeneMark.hmm E-3.0 for eukaryotic genomes,supervised model parameterization

The initial GeneMark.hmm algorithm was developed for genefinding in prokaryotic genomes (31) and later was extended foruse of eukaryotic gene models GeneMark.hmm E-1.0(A. Lukashin and M. Borodovsky, unpublished data). Thenext program version GeneMark.hmm E-2.0 (G. Tarasenkoand M. Borodovsky, unpublished data) has been used forannotation of several plant genomes including A.thaliana(30,32) and Oryza sativa (33). Here we describe the latestprogram version, GeneMark.hmm E-3.0.

The statistical model of genomic sequence organizationemployed in the GeneMark.hmm algorithm is a HMM withduration (34) or a hidden semi-Markov model (HSMM). TheHSMM architecture consists of hidden states for initial,internal and terminal exons, introns, intergenic regions andsingle exon genes (Figure 1). It also includes hidden statesfor start site (initiation site), stop site (termination site), anddonor and acceptor splice sites. In what follows, we refer tosuch hidden states as site states.

The site states emit nucleotide sequences of fixed lengthmodeled by positional (inhomogeneous) Markov chains(35,36). The length and parameters of these models are sitetype-dependent and determined from the sets of sequences ofverified sites of a given type. Note that the models forsequences emitted by splice site states are also intronphase-dependent.

The protein-coding states (initial, internal, terminal exonsand single exon gene) emit nucleotide sequences modeled bythe three-periodic inhomogeneous Markov chains (1,37,38).

Parameters of these models are chosen to be tied and areestimated from the sets of annotated protein-coding sequences(Datasets section). Orders of the Markov chains, up to the 5thorder, are chosen depending on the total length of the trainingsequence.

The non-coding states (intron and intergenic region) emitsequences modeled by homogeneous Markov chains(1,37,38). Importantly, the parameters of the intron and inter-genic region models may not be the same. Parameters of theintron models are estimated from the set of annotated intronsequences. Since a set of reliably annotated intergenic regionsis not readily available, parameters of the models of intergenicregions are estimated from the set of direct and reverse com-plement of intron sequences.

Hidden state duration distributions are derived as approx-imations of observed in the training set length distributionsof the sequences associated with a particular hidden state.For exon sequences this approximation is derived in twosteps: (i) averaging the length frequencies over a period ofthree to eliminate the three-periodic component (this step isnot needed for the of intron state duration approximation); and(ii) applying a smoothing algorithm, such as the nearest neigh-bor method (39) to get the final approximation. For adequatederivation of the distribution of duration of the sequence emit-ted by the state corresponding to the single exon gene we haveto overcome a difficulty caused by the small sample effects,such as overfitting, as the set of single exon genes is commonlya rather small fraction of the supervised training set. It turnedout that a reasonable approximation to the single exon genelength distribution is provided by the length distribution ofannotated CDSs of all genes in the training set. Finally, in theabsence of the reliable set of intergenic regions the uniformprobability distribution is used for the duration of intergenicstate.

Duration distributions are characterized by minimum andmaximum values. The maximum duration of a sequence emit-ted from an exon state is set to the maximum ORF lengthobserved in the given genome, while the minimum durationis 3 nt. The minimum and maximum durations of intron andintergenic sequences are set to 20 and 10 000 nt, respectively.

Initiation and termination of the trajectory of the hiddenstates of HSMM is allowed in either intron or intergenicstate. Distribution of the length (duration) of the sequenceemitted by the initial and terminal states differs from theduration distribution defined for the regular (internal) state.This distribution is determined as a convolution of a regulardistribution of the state-specific duration with the uniformdistribution. Note that the initialization and termination hiddenstates are not shown in the HSMM diagram (Figure 1).

Outline of the unsupervised gene findingalgorithm GeneMark.hmm ES-3.0

The algorithm of parallel unsupervised (automatic) trainingand gene prediction (Figure 2) consists of the following steps:(i) all parameters of the HSMM model with reduced architec-ture are initialized (as described below); (ii) GeneMark.hmmE-3.0 is run to determine a genomic sequence parse into ‘cod-ing’ and ‘non-coding’ regions and the input genomic sequenceis labeled with respect to this parse; and (iii) the subsets of theuniformly labeled fragments (selected as described below in

intergenic region

introns

startsite

stopsite

initial exon

single exon gene

donorsites

acceptorsites

internal exons

terminal exon

Figure 1. Diagram of hidden states of the HSMM employed in the eukaryoticGeneMark.hmm (E-3.0); only states emitting sequence of the direct DNAstrand are shown, while the states generating sequence of the complementarystrand (the mirror symmetrical part of the diagram with reversed arrows andhorizontal symmetry line crossing ‘intergenic region’ state) are omitted.

6496 Nucleic Acids Research, 2005, Vol. 33, No. 20

at (D) Tokyo K



ownloaded from

Algorithms in detail �  Spliced alignment

�  Intron を跨いでmRNA がalign された 2 塩基をsplice junction としてidentify する．

�  ‘Anchor splice sites’ �  ab initio (基本的にはGeneMark-ES) 手法と，RNA-Seq

read alignment の両方において共通して予測されたsplice sites のこと．




Version

Assemblylength(Mb)

Unknownletters(Mb)

Masked seq(Mb)

‘atcg’ seq(Mb)






R5 120 0.1 9 111 8 SRR042297 75 nt, paired 13.6




Annotation Test set


Spliced alignment

Mapped splice sites




Genome sequence

RNA-seq reads


eurist Splice

Mappeial HM



t

Gene








at (D) Tokyo K



ownloaded from


Figure 3. Selection of elements of training set in GeneMark-ET for the next iteration. The new training set of protein-coding regions is comprised fromexons with at least one ‘anchored splice site’ as well as long exons predicted ab initio (>800 nt).

tween hidden states depend less on the anchored sites. Par-ticularly, probability of transition from state of intergenicregion to state of the border exon of multiple exon gene isestimated from distribution of the predicted structures ofcomplete genes; this data include structures with both an-chored and un-anchored exons.

The first step of training process, assignment of initial al-gorithm parameters, required special attention. Some pa-rameters could be determined better from the start due toavailability of RNA-Seq reads. Among introns mapped byRNA-Seq reads by TrueSight, UnSplicer or TopHat we at-tempted to identify a subset with likely higher confidence.This subset is used to estimate parameters of the initialmodels of donor and acceptor sites as well as the intronlength distribution.

TrueSight or UnSplicer algorithms assign to a mappedintron a numerical score, S, 0 < S < 1, with the meaningsimilar to posterior probability of the predicted splice junc-tion. Introns with scores S > 0.5 were selected into the highconfidence set. In case of UnSplicer score S > 0.5 corre-sponds to likelihood P < 0.05 that intron mapping is er-roneous. In TopHat, the higher is the coverage number (acount of mapped RNA-Seq reads overlapping a predictedsplice junction) the higher, in general, is confidence in thepredicted splice junction and corresponding intron. WhenTopHat is the read aligner, the high confidence set includesintrons supported by more than three aligned reads.

In the first run of the Viterbi algorithm (Figure 2), we usesplice site models and intron length distributions definedfrom the high confidence set of introns, as well as heuristi-cally derived models of protein-coding and non-coding re-gions (22); other HSMM parameters are initialized as un-informed priors (5). After the first iteration, the algorithmuses for training the full set of mapped introns, regardlessof the probabilistic score (or coverage value).

Similar to the training procedure introduced forGeneMark-ES (5) structures of some sub-models areexpanded through iterations, thus, the set of parametersbecomes larger. Particularly, each splice site model changesfrom single frameless models into the set of three phasedsub-models related to the three phases of introns usedin training. Note that if the predicted phase of an intron

changes between iterations, a splice site associated with thisintron will move from the set of sites with the old phase tothe set of sites with the new phase. Also, parameters of thebranch point site model, especially important for fungalgenomes, are not defined until one of the later iterations(6).

The GeneMark-ET training algorithm stops upon con-vergence to the set of predicted genes that does not changein further iterations, or upon completing a pre-defined num-ber of iterations.

Accuracy assessment of GeneMark-ET predictions wasdone on the test sets described above. Sensitivity and Speci-ficity were defined at the levels of splice sites, translation ini-tiation and termination sites, exons and introns, nucleotideand whole gene level. We also added the level of ‘partialgenes’, with the meaning of the ‘5′ end incomplete genes’.This level was introduced because a reliable annotation fortranslation starts is especially difficult to obtain; even in theprokaryotic domain, sets of genes with translation initia-tion sites validated by N-terminal protein sequencing arescarce. Therefore, computation of ’whole gene’ level accu-racy that requires exact match of predicted and annotatedtranslation starts may suffer from uncertainty in the trans-lation start position. To calculate Specificity measures, wecompared annotation of each gene from the training setto predictions in genomic intervals that include the anno-tated genes extended by 300 nt to both sides (see ‘Data sets’section).

RESULTS

GeneMark-ET was trained on genome sequences and setsof RNA-Seq reads available for the fruit fly D. melanogasterand the four mosquito species A. aegypti, A. gambiae, A.stephensi and C. quinquefasciatus (see ‘Data sets’ section,Tables 1 and 3).

The 208 Mb A. stephensi genomic sequence with 33 018gaps (four orders of magnitude larger than number of gapsin D. melanogaster genome) after masking and filteringshort contigs was reduced to 97 Mb (Table 3). The largereduction of A. stephensi sequence data was due to the factthat the fragmented assembly contained many short contigs.

at (D) Tokyo K



ownloaded from

: indicates an ‘anchor’

Algorithms in detail �  Parameter estimation

�  基本的な流れ 1.  初期parameters にてprotein-coding を予測． 2.  新たに予測されたprotein-coding/non-coding regions を利用し，HSMM parameters を更新し，予測．

3.  以下の終了条件(OR)を満たすまで，1 -> 2 を繰り返す． �  遺伝子予測結果が変更されなくなる． �  指定のiteration 回数に到達．




Version

Assemblylength(Mb)

Unknownletters(Mb)

Masked seq(Mb)

‘atcg’ seq(Mb)






R5 120 0.1 9 111 8 SRR042297 75 nt, paired 13.6




Annotation Test set


Spliced alignment

Mapped splice sites




Genome sequence

RNA-seq reads


eurist Splice

Mappeial HM



t

Gene








at (D) Tokyo K



ownloaded from


�  Emission probabilities �  基本ルール最小1つのanchor splice site を持つexon のみをlikely protein-coding exon として扱い，estimation に利用．

�  例外 800bp 超のab initio 予測exons は，anchored/un-anchored 問わず，HSMM protein-coding states のemission prob. estimation に利用．




Version

Assemblylength(Mb)

Unknownletters(Mb)

Masked seq(Mb)

‘atcg’ seq(Mb)






R5 120 0.1 9 111 8 SRR042297 75 nt, paired 13.6




Annotation Test set


Spliced alignment

Mapped splice sites




Genome sequence

RNA-seq reads


eurist Splice

Mappeial HM



t

Gene








at (D) Tokyo K



ownloaded from

60nt 60nt

Predicted coding exon

Region used for emission prob. estimation


�  Emission probabilities �  of TIS and translation termination sites �  ’Anchored’ initial/terminal exon を用いてparameter 推定．

�  of non-coding regions �  Intron -> 2 anchored splice sites に挟まれた領域を利用． �  IR -> 2 anchored border exons に挟まれた領域を利用．

: indicates an ‘anchor’




Version

Assemblylength(Mb)

Unknownletters(Mb)

Masked seq(Mb)

‘atcg’ seq(Mb)






R5 120 0.1 9 111 8 SRR042297 75 nt, paired 13.6




Annotation Test set


Spliced alignment

Mapped splice sites




Genome sequence

RNA-seq reads


eurist Splice

Mappeial HM



t

Gene








at (D) Tokyo K



ownloaded from

Anchored border exons

Anchored border exons

Candidate for IR

Candidate for an intron Candidate for an intron


�  Transition probabilities �  Transition probabilities はanchored sites とは，それほど関係なく算出される． �  例えば，IR -> exon の遷移を考えるとすると，complete

gene として予測された遺伝子全てを，anchored/un-anchored 問わずに考慮に入れて，その分布を考えるので，anchored exon だけに影響されることはない．




Version

Assemblylength(Mb)

Unknownletters(Mb)

Masked seq(Mb)

‘atcg’ seq(Mb)






R5 120 0.1 9 111 8 SRR042297 75 nt, paired 13.6




Annotation Test set


Spliced alignment

Mapped splice sites




Genome sequence

RNA-seq reads


eurist Splice

Mappeial HM



t

Gene








at (D) Tokyo K



ownloaded from

OF 8 Nucleic Acids Research, 2014, Vol. 42, No. 15 e119



Version

Assemblylength(Mb)

Unknownletters(Mb)

Masked seq(Mb)

‘atcg’ seq(Mb)






R5 120 0.1 9 111 8 SRR042297 75 nt, paired 13.6




Annotation Test set


Spliced alignment

Mapped splice sites




Genome sequence

RNA-seq reads


eurist Splice

Mappeial HM



t

Gene








at (D) Tokyo K



ownloaded from

1st iteration


�  How to exploit RNA-Seq data at each process

High confidence set of introns Spliced alignment にて予測されたsplice sites の内，高信頼性なもののみを利用．

TrueSight, Unsplicer 予測されたsplice sites に与えられたスコアS > 0.5 (0 < S < 1) を利用． TopHat 3 本以上のcoverage によって支持されたsplice sites やintron を利用．

Heuristically derived models of genetic regions

Genome seq. からcodon 出現頻度を推定 <UNDERMENTIONED>

その他のHSMM param. は一様な事前確率を使用

2nd iteration and later

New predictions from the previous iteration

Full set of mapped introns スコアやcov. に関わらず全alignments を使用．

OF 8 Nucleic Acids Research, 2014, Vol. 42, No. 15 e119



Version

Assemblylength(Mb)

Unknownletters(Mb)

Masked seq(Mb)

‘atcg’ seq(Mb)






R5 120 0.1 9 111 8 SRR042297 75 nt, paired 13.6




Annotation Test set


Spliced alignment

Mapped splice sites




Genome sequence

RNA-seq reads


eurist Splice

Mappeial HM



t

Gene








at (D) Tokyo K



ownloaded from

appendix 1 (out of 1) �  Heuristically derived models of genetic regions

�  Besemer,J. and Borodovsky,M. (1999) Heuristic approach to deriving models for gene finding より．

Nucleic Acids Research, 1999, Vol. 27, No. 19 3913

Figure 1. (A) Frequency of nucleotide T in three codon positions observed in 17 bacterial genomes shown as a function of global nucleotide T frequency in a given

at (D) Tokyo K

ogyo Daigaku Library on A

ugust 2, 2015http://nar.oxfordjournals.org/

Dow

nloaded from

Genome seq.

protein-‐coding regions 中の positional nucl. freq.

Codon freq. 上記positional nucl. freq. の積より算出．

Modified codon freq.

GC content と特定のa.a 出現頻度に相関有り．

Genome seq. 全体の塩基出現頻度とcodon 中のpositional nucl. freq.

に相関有り

Algorithms in detail �  HSMM の各state に対して存在するmodel は

iteration の進行に従って拡張される． �  Splice site model

�  Branch point site model

A single frame-less model

Set of 3 phased sub-models

No info about BP BP site model

BP info is obtained from the prediction created through iterations

Updated after each iteration after params for this model are defined

Updated after each iteration And allowed to move among 3 sub-models

Algorithms in detail �  Accuracy assessment

�  Metrics �  Sp and Sn at the levels of �  Splice sites, translation initiation/termination sites, exons and

introns ( = elements of gene structures) �  Nucleotide �  Whole gene (described as ‘gene’ in tables) �  ‘partial genes’: 5’ end incomplete genes

�  TIS の予測が非常に難しいため，この指標を導入． �  Annotation の方も，N-term. prot. seq. で全ての遺伝子をvalidate できているわけではない．

�  Notes �  RNA-Seq data はprediction step には用いておらず，

training step に寄与するのみである．




Version

Assemblylength(Mb)

Unknownletters(Mb)

Masked seq(Mb)

‘atcg’ seq(Mb)






R5 120 0.1 9 111 8 SRR042297 75 nt, paired 13.6




Annotation Test set


Spliced alignment

Mapped splice sites




Genome sequence

RNA-seq reads


eurist Splice

Mappeial HM



t

Gene








at (D) Tokyo K



ownloaded from

Results

Training set prep. �  Training sets preparation

�  Size reduction は以下の要素によって発生した． �  Repeat masking �  Filtering of short contigs

�  Training set に用いたdata は元より大幅に減少した． PAGE 5 OF 8 Nucleic Acids Research, 2014, Vol. 42, No. 15 e119

Table 3. Lengths of initial genomic sequence and sequence selected into training process after data pre-processing steps (repeat masking and subsequentfiltering of short contigs); sizes of the initial set of introns mapped by RNA-Seq read aligner (UnSplicer) to the full genome and the set of introns mappedto the reduced genome

SpeciesGenome size

(Mb)Sequence in

training (Mb)Introns mapped

to genomeIntrons intraining % of introns

Aedes aegypti 1384 415 57 684 55 702 96.6Anopheles gambiae 273 201 68 827 59 698 86.7Anopheles stephensi 208 97 28 869 20 418 70.7Culex quinquefasciatus 528 195 57 579 56 621 98.3Drosophila melanogaster 120 97 70 077 56 678 80.9

The A. aegypti and Culex q. genomic sequences with highrepeat content were reduced to 415 and 195 Mb, respectively(Table 3). The A. gambiae and D. melanogaster genomic se-quences after reduction were 201 and 97 Mb respectively.

The predictions made by GeneMark-ET on the whole setof genomic sequences were compared with annotation inthe locations of the genes included in the test sets (see ‘Datasets’ section). The accuracy of predictions was comparedwith the accuracy of predictions made by GeneMark-ES.Note that mapping of RNA-Seq reads was not used directlyto modify the gene predictions. Thus, improvements in ac-curacy of GeneMark-ET over GeneMark-ES should be at-tributed to a more accurate training method delivering bet-ter estimates of algorithm parameters.

In our experiments we observed that for all five species,GeneMark-ET outperformed GeneMark-ES. In particular,GeneMark-ET prediction accuracy measured by Sensitivity(Sn or Recall) and Specificity (Sp or Precision) was higherfor nearly all elements of gene structure, for whole genes,and for ‘partial genes’.

At the internal exon level GeneMark-ET performed uni-formly better than GeneMark-ES. However, the increase inaccuracy varied from rather small 0.5% in Sp and 6.0% inSp for D. melanogaster to 22.4% in Sn and 15.4% in Sp forA. aegypti (Table 4).

A similar trend was observed for both whole gene leveland ‘partial gene’ levels (Table 4). For D. melanogaster, theGeneMark-ET accuracy increased 7.3% in Sn and 5.2% inSp over GeneMark-ES. For the largest mosquito genomeA. aegypti the GeneMark-ET predictions improved at the‘partial gene’ level by 27.8% in Sn and by 22.9% in Sp. Forthe next largest genome, C. quinquefasciatus, the increasenumbers were 18.0 and 16.7% respectively, while for thetwo smaller Anopheles genomes the increases at the ‘par-tial gene’ level were below 10%. The absolute Sn and Sp fig-ures at gene and ‘partial gene’ levels are the lowest for eachspecies (Table 4). These two measures are the most stringentaccuracy definitions that require exact prediction of all oralmost all the elements of exon–intron structure.

In all the five insect genomes considered here, trans-lation termination sites were predicted more accuratelythan translation initiation sites by both GeneMark-ES andGeneMark-ET. Difficulty in translation initiation start pre-diction is related to the presence of alternative in-frame andout-of-frame ATG codons. Also, initial exons are on aver-age shorter than terminal exons and are more difficult toidentify. Finally, the annotation of translation starts is likelyto be the least reliable element in the annotated exon–intronstructure. For instance, most of the translation starts in D.

melanogaster genome were annotated by ‘the longest ORFin transcript’ rule (http://flybase.org); only a few gene startswere verified by protein N-terminal sequencing.

The accuracy figures for GeneMark-ET shown in Table4 are the results of algorithm runs utilizing alignments ofRNA-Seq reads made by UnSplicer (21). UnSplicer is oneof the few methods of choice for RNA-Seq read alignmentand splice junction detection. We previously showed (21)that for several species UnSplicer produced a higher accu-racy of prediction of splice junctions with respect to genomeannotation compared to other methods. This higher accu-racy of mapping introns is likely related to reduction in thenumber of false positives caused by splicing noise and otherfactors (23,24). However, the results of the GeneMark-ETruns on D. melanogaster and A. gambiae genomic datawith RNA-Seq read mapping done by TopHat or True-Sight (Supplementary Table S1) show that performance ofGeneMark-ET is robust with respect to replacing UnSplicerby TopHat or TrueSight. This observation indicates that useof ‘anchored splice sites’ by GeneMark-ET helps filter outinstances of false positive splice junction predictions.

To assess dependence of the GeneMark-ET performanceon the size of the set of mapped introns we made runs of thenew tool on D. melanogaster sequence data with mappedintron sets comprising 75, 50 and 25% of the initial set of70,077 mapped introns (Table 3). Interestingly, the accu-racy of gene prediction did not change noticeably in theseexperiments (data not shown), thus, the algorithm is ro-bust with respect to the changing the number of mappedintrons within a wide range. Obviously, if the initial num-ber of mapped introns is too low, GeneMark-ET should beautomatically switched to GeneMark-ES mode (current de-fault threshold for the switch is 1000 introns).

We analyzed the dependence of mean values of inter-nal exon Sn and Sp on iteration index for D. melanogasterand A. aegypti genomes for both GeneMark-ES andGeneMark-ET (Figure 4). The GeneMark-ET initialparameterization integrating information from mappedRNA-Seq reads improved accuracy of predictions in thefirst iteration by 55–60% in comparison with GeneMark-ES. For D. melanogaster, further iterations reduced the largeinitial gap in accuracy down to 4%. In contrast, for thelarge A. aegypti genome, although the gap was reducedwith iterations, the accuracy of GeneMark-ET at conver-gence remained almost 20% higher than one of GeneMark-ES. Also, GeneMark-ET reached convergence 2–3 itera-tions earlier (Figure 4). The reduction in number of itera-tions was observed for the other three genomes as well (datanot shown).

at (D) Tokyo K



ownloaded from

Due to … A. Stephensi -‐> fragmented assembly contining many short contigs A. aegypti, Culex q. -‐> high repeat content

�  Assessment of prediction accuracy �  GeneMark-ET はGeneMark-ES に対して，全sp. において圧倒した． � ほとんどの項目において，上回っている． e119 Nucleic Acids Research, 2014, Vol. 42, No. 15 PAGE 6 OF 8

Table 4. Assessment of gene prediction accuracy of GeneMark-ES (ES) and GeneMark-ET (ET) gene finders using unsupervised (genomic based) andsemi-supervised (genomic and transcriptomic based) training, respectively

D. melanogaster A. aegypti A. gambiae A. stephensi Culex q.

ES ET ES ET ES ET ES ET ES ET

Internal exon Sn 86.7 87.2 69.3 91.7 77.6 80.4 82.7 85.1 77.4 81.8Sp 76.9 82.9 60.7 75.9 70.3 78.6 76.5 77.0 54.7 65.7

Intron Sn 82.6 84.8 67.9 89.6 77.6 81.0 85.2 88.1 70.2 81.1Sp 75.3 79.2 64.6 80.3 73.4 80.5 79.4 81.7 59.8 72.7

Donor site Sn 85.3 87.0 74.6 92.8 81.9 84.1 88.2 90.4 74.3 83.5Sp 84.5 86.5 76.2 86.8 82.9 88.1 87.3 88.1 74.3 80.7

Acceptor site Sn 86.2 88.2 74.3 94.1 83.0 86.0 90.7 92.8 83.9 88.7Sp 85.5 87.0 79.0 89.6 83.6 88.9 87.7 89.2 78.0 84.6

Initiation site Sn 71.0 75.1 62.5 79.6 63.8 68.1 65.0 66.9 60.8 76.7Sp 83.1 81.5 77.1 83.9 80.0 79.9 73.6 76.3 77.4 85.7

Termination site Sn 77.3 84.2 68.1 88.0 72.9 81.0 83.0 84.9 78.9 82.8Sp 90.7 90.0 91.3 96.0 89.7 91.6 86.5 92.4 89.3 90.9

Nucleotide Sn 91.5 92.1 87.0 98.1 91.4 92.9 97.0 97.3 93.9 94.4Sp 98.3 97.4 95.2 96.2 98.6 98.8 98.5 98.7 92.0 93.0

Gene Sn 57.9 63.6 40.3 66.7 43.8 53.1 43.2 48.6 46.1 65.0Sp 57.3 61.0 42.6 64.3 44.0 53.0 39.9 47.0 44.3 62.6

Partial gene Sn 59.9 67.2 41.2 69.0 46.2 56.0 48.6 54.3 48.1 66.1Sp 59.3 64.5 43.6 66.5 46.4 55.8 44.9 52.4 46.1 63.6

Bold font highlights the higher accuracy value in a given category and given species. Partial gene level accuracy is computed without taking into accounta difference in annotation and prediction of translation starts.Spliced alignments for GeneMark-ET were produced by UnSplicer.

Figure 4. Observed dynamics of change in iterations of the mean of Sn and Sp internal exon prediction values for the GeneMark-ET and GeneMark-ESalgorithms in cases of Drosophila melanogaster (A) and Anopheles aegypti (B) genomes.

The precision of GeneMark-ET in selecting ‘anchored in-trons’ is illustrated by the following statistics. Only 34 of an-chored introns out of 12,554 identified by GeneMark-ET inthe first iteration of training do not match D. melanogasterannotation version 48 (Supplementary Figure S1).

Information on mapped RNA-Seq reads, besides use intraining, could be used directly in prediction steps to makethe parse Viterbi algorithm fitting the mapped introns ineach iteration. We made an extended version of GeneMark-ET where this approach was implemented in all iterationsbut the last one. Interestingly, in computational experimentson the D. melanogaster genome and the RNA-Seq set thismodification did not produce noticeable change in accuracyof predictions on the test set (data not shown).

DISCUSSION

Annotators of novel eukaryotic genomes frequently use apipeline, MAKER2 (25) that includes the three gene pre-diction tools, Augustus, GeneMark-ES and SNAP that in-dependenly generate gene predictions. Training sets compi-lation for SNAP is aided by mapping of assembled tran-scripts as well as proteins from protein databases. Alterna-tively, GeneMark-ES could provide the initial gene models,the lynchpins for the training of SNAP and Augustus. While‘unsupervised’ training is critically important in this kind ofpipeline, many large genomes present difficulties for unsu-pervised training (e.g. genomes with inhomogeneous com-position, or genomes populated with repeats and transpos-able elements). Assembly of short NGS reads into contigs is

at (D) Tokyo K



ownloaded from

�  Assessment of prediction accuracy �  Internal exon level では，全sp. においてET が勝っていたが，評価値向上度にはvariation が見られた．

�  同様の傾向がwhole gene level, partial gene level でも観察された．

�  Mosquito の中で，genome size の大きいsp. は大幅な向上を，小さいsp. は10%未満の向上が見られた．

　　 D. mel A. aegypti

Internal exon Sn +0.4 +22.4 Sp +6.0 +15.4

Gene Sn +5.7 +26.4 Sp +3.7 +21.7

Partial gene Sn +7.3 +27.8 Sp +5.2 +22.9

D. mel A. aegypti Culex q. the others Genome size (seq in training) 1384 (415) 273 (201) 528 (195) relatively small

Partial gene Sn +7.3 +27.8 +18.0 +(<10) Sp +5.2 +22.9 +17.5

�  Translation initiation & termination sites �  後者の方がより正確に予測されていた． (ET, ES 双方において)

�  In-frame and out-of-frame ATG codons の存在に依る． �  Initial exon はterminal exon より短く，予測しがたい． �  DB のTIS が’longest ORF in transcript’ rule に基づいて決定されたものがほとんどであり，正答しづらい． �  ほとんどがprotien N-terinal sequencing によってvalidate されていないという状況．

�  ‘Anchored splice sites’ はFP splice site prediction の予測結果への影響減少に有効である． �  Spliced alignment の段階では，FP が少ないために，

UnSplicer が，TopHat とTrueSight より正解セットに合致するsplice site 予測結果を出していた．

�  しかし，それぞれの結果を’anchored splice sites’ として利用し，GeneMark-ET で予測すると，後者2つもUnSplicer のものと遜色ない結果を示した．(mentioned in the next page)

�  よって，’anchored splice sites’ の利用は，FP の影響を除外する効果があると判明した．

� どのsplice aligner を使用した場合においても，ほぼ優劣のない予測結果が得られた．

with UnSplicer with TrueSight with TopHat with UnSplicer with TrueSight with TopHatInternal exon Sn 87.2 86.5 87.2 80.4 80.3 80.7

Sp 82.9 83.0 81.6 78.6 78.3 79.2Intron Sn 84.8 83.8 84.9 81.0 81.1 81.5

Sp 79.2 79.0 78.2 80.5 80.4 81.2Donor site Sn 87.0 86.3 86.9 84.1 84.2 84.4

Sp 86.5 86.9 85.9 88.1 88.1 88.5Acceptor site Sn 88.2 87.2 88.4 86.0 86.1 86.4

Sp 87.0 86.9 86.4 88.9 88.8 89.6Initiation site Sn 75.1 74.6 74.6 68.1 68.1 68.3

Sp 81.5 81.6 82.0 79.9 79.9 79.5Termination site Sn 84.2 83.6 83.0 81.0 81.0 81.9

Sp 90.0 89.8 89.3 91.6 91.4 92.5Nucleotide Sn 92.1 92.0 92.1 92.9 93.0 93.0

Sp 97.4 97.4 97.3 98.8 98.8 98.8Gene Sn 63.6 62.6 62.6 53.1 52.9 54.3

Sp 61.0 60.2 59.7 53.0 52.6 54.0

D. melanogaster A. gambiae

�  Dependence on the size of the set of mapped introns �  Data sets

�  D. mel data set の75, 50, 25% 分の数のmapped intron を抽出したdata set を作成し，予測精度を測定

�  Results � どのdata set においても予測精度に特筆すべき変化のない，高いrobust 性を示した．

�  Note �  Intron 数が1,000 を下回ると，GeneMark-ES としてrun.

%mapped introns 100 75 50 25 #mapped introns 70,077 52,558 35,039 17,519

� 予測精度とRNA-Seq data の関係 �  GeneMark-ES 比で，initial iteration において55-60% の予測精度向上を確認．

�  (Internal exon Sn + Sp )/2 値がiteration に従って向上． �  ET-ES 間のgap がiteration 進行に従って縮まるが，

convergence 段階においてもET の方が精度が良い．(ただ，sp. によって最終的なgap の開きが異なった．)

�  Convergence に達するのが，ET は早い．(この現象は全sp. において確認された．)






Intron Sn 82.6 84.8 67.9 89.6 77.6 81.0 85.2 88.1 70.2 81.1Sp 75.3 79.2 64.6 80.3 73.4 80.5 79.4 81.7 59.8 72.7

Donor site Sn 85.3 87.0 74.6 92.8 81.9 84.1 88.2 90.4 74.3 83.5Sp 84.5 86.5 76.2 86.8 82.9 88.1 87.3 88.1 74.3 80.7




Nucleotide Sn 91.5 92.1 87.0 98.1 91.4 92.9 97.0 97.3 93.9 94.4Sp 98.3 97.4 95.2 96.2 98.6 98.8 98.5 98.7 92.0 93.0

Gene Sn 57.9 63.6 40.3 66.7 43.8 53.1 43.2 48.6 46.1 65.0Sp 57.3 61.0 42.6 64.3 44.0 53.0 39.9 47.0 44.3 62.6






DISCUSSION


at (D) Tokyo K



ownloaded from

Gap of 4% Gap of almost 20%

�  ‘Anchored introns’ 抽出基準が適切であり，ab initio 予測精度に貢献している． �  FlyBase (正解セット), 1st iteration 時点でのGeneMark-ET 予測結果, UnSplicer 使用でのmapped introns 間でのintron 比較．

99.7% of ‘anchored introns’ was ‘correct’

*予測対象ではない，UTR 中のintrons を除く． *さらに，alternatively spliced isoforms も考慮しない．

�  Prediction steps におけるRNA-Seq data 利用 �  RNA-Seq data が，training step で予測精度に寄与していることが明らかになった．

�  RNA-Seq data を直接prediction step で用いて，Viterbi を誘導すれば，より精度を向上させる可能性がある．

�  しかし，我々の実験では，目を見張る向上が見られなかった． �  Last iteration を除く，全iteration のprediction steps にて

RNA-Seq data を利用．

Discussion

�  ‘Unsupervised’ training は他pipeline においても重要． �  新規生物種遺伝子予測に用いるpipeline では，

unsupervised training を行うことの出来る ab initio 予測が重要なロールを果たす． � 例えば，MAKER2 ではGeneMark-ES が用いられ，

Augustus, SNAP にtraining set を供給している．

Augustus SNAP

Initial GS

GeneMark-ES

Genome seq.

Input

Output

Provided for the training

�  GeneMark-ET は予測困難とされるlarge genome への対応に，RNA-Seq data を新たな方法で用いることで成功．

Old-fashioned GeneMark-ET

Collection of gene structure elements (e.g. anchored exons)

Transcriptome RNA-Seq mapping information

Transcriptome によってcover された遺伝子領域だけでtraining を行い，遺伝子予測を効率的に実行するためにRNA-Seq data を利用．

RNA-Seq reads のassembly は行わず，遺伝子を構成するelement 単位の情報を利用．

Large genomes 多量のrepeat content とassembly の不完全さによって，train に必要な完全/完全に近い遺伝子構造を取得し難い．

完全な遺伝子構造に拘らないので RNA-Seq data から得られる情報量が増大し， training に好影響．

High repeat content rate やfragmentation of genome assembly に負けないrobust な遺伝子予測が可能に． (A. stephensi test set)

正しくassembly されたtranscript のmapping info を用いる．(ただ，ほとんどのtranscript にerror が発生する)

得られる情報量が少量

Genome character and prediction �  Sample genomes

�  Insect genome はmammal やplant genomes より遺伝子密度が低い．(よって難しい?)

�  Genome size and prediction accuracy �  Compact genome (D. mel)

� 十分に高いレベルの予測精度に達した． �  Large genomes (A. argypti, Culex q.)

�  Purely unsupervised であるES では劇的に予測精度低下 �  ET では，特にgene, partial gene level において，Sn, Sp で共に，18-25% の向上を見た．






Intron Sn 82.6 84.8 67.9 89.6 77.6 81.0 85.2 88.1 70.2 81.1Sp 75.3 79.2 64.6 80.3 73.4 80.5 79.4 81.7 59.8 72.7

Donor site Sn 85.3 87.0 74.6 92.8 81.9 84.1 88.2 90.4 74.3 83.5Sp 84.5 86.5 76.2 86.8 82.9 88.1 87.3 88.1 74.3 80.7




Nucleotide Sn 91.5 92.1 87.0 98.1 91.4 92.9 97.0 97.3 93.9 94.4Sp 98.3 97.4 95.2 96.2 98.6 98.8 98.5 98.7 92.0 93.0

Gene Sn 57.9 63.6 40.3 66.7 43.8 53.1 43.2 48.6 46.1 65.0Sp 57.3 61.0 42.6 64.3 44.0 53.0 39.9 47.0 44.3 62.6






DISCUSSION


at (D) Tokyo K



ownloaded from

�  Repeat masking はよりfine-tuned なものが必要． �  Repeat masking は，TE のような’host’ のものでない遺伝子によって遺伝子予測がbias されることを防ぐ．

�  かなりrare event だが，TE またはその一部が，’host’ の遺伝子中に挿入されていることがある．

�  Repeat masking によって，予測精度が向上した． �  そして，上昇率はgenome size が大きいほど顕著．

�  Ex.1) A. gambiae �  GeneMark-ES training にて, internal exon 認識率が5% 上昇．

�  Ex.2) A. aegypti (as a larger genome) �  同認識率が15% 程度上昇した．

�  GeneMark-ET においては，training 前にrepeat masking を行うことは，それほど重要ではない． �  Training に関係しているparameters はrepeat masking に依存しない． �  ‘Anchor’ するには，RNA-Seq のsupport が必要

-> RNA-Seq には，’host’ transcriptome 情報のみが含まれるため，masked/unmasked genome seq. に依存なし．

�  Ex.) Intron related states でのemission prob. など． �  初期段階で情報のない領域に関するparameters は，

repeat masking に影響される． �  Non-coding regions での予測モデルに関するものなど．

�  ‘Anchored splice sites’ を用いることは，危険なtranscript assembly を回避する事ができる． �  Assembled transcripts でtraining set を作製するならば，多くのtranscript assembly にerror が存在することは無視できない． � 最高のtool の一つであるCufflinks ですら，Human では約60%，D. mel では約50% の，C. elegans では約40% のtranscript assembly がerror を含むと報告有り．

�  その上，そのerror がset 自体の深刻なレベルの崩壊を招く危険性すら存在する．

�  ‘Anchored splice sites’ は， �  unbiased なparameters を導くことが可能．

� どのようなsource であっても平等に扱うことが可能． �  RNA-Seq data のdata 量に関わらない，robustness を示した． �  Mapping 段階でのFP を除去することが出来る． �  UTRs 内のintron を区別することが出来る．

�  Running time がES に比較し，ET は減少した． �  早期のconvergence 到達によって，iteration 回数が減少したため．

�  GeneMark-ET の実行時間 �  3 hrs (= 5x Viterbi algorithm)

�  Single 3GHz CPU �  数十分

� マルチコア? �  その他にRNA-Seq alignment に時間が必要ではある．

�  Unsupervised training は，validated genes が少数であるほど，supervised training に対して，速度と十分な予測精度によって，優位性を発揮する．

� 当然，validated genes を何ヶ月も何年もかけて増加させれば，supervised training の方が正確さを増すであろうことは言うまでもない．

Conclusion

Conclusion �  Semi-supervised training をHSMM parameter 推定に用いるGeneMark-ET を開発した．

�  GeneMark-ET の特徴は，unassembled RNA-Seq reads を用いること．Alignment から， ’anchored splice sites’ 情報を抽出してtraining に利用する．

� 結果，GeneMark-ET は，高速かつ十分な予測精度を持ち，様々な問題を持つlarge genomes にも対応可能となった．