vectorbase brc4 20061 the evolving vectorbase gene build: mixing automated and manual approaches...

VectorBase BRC4 2006 1 The evolving VectorBase gene build: mixing automated and manual approaches when annotating vector genomes Daniel Lawson VectorBase-EBI, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton UK

Upload: kevin-wood

Post on 04-Jan-2016




0 download


Page 1: VectorBase BRC4 20061 The evolving VectorBase gene build: mixing automated and manual approaches when annotating vector genomes Daniel Lawson VectorBase-EBI,

VectorBase BRC4 2006 1

The evolving VectorBase gene build: mixing automated and manual

approaches when annotating vector genomes

Daniel Lawson

VectorBase-EBI, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton UK

Page 2: VectorBase BRC4 20061 The evolving VectorBase gene build: mixing automated and manual approaches when annotating vector genomes Daniel Lawson VectorBase-EBI,

VectorBase BRC4 2006 2

VectorBase species

Generic GeneBuild (new genomes)

VectorBase GeneBuild (new developments)

Influence of manual annotation

Progress in manual annotation

Partial GeneBuilds

Points to cover

Page 3: VectorBase BRC4 20061 The evolving VectorBase gene build: mixing automated and manual approaches when annotating vector genomes Daniel Lawson VectorBase-EBI,

VectorBase BRC4 2006 3

Aedes aegypti

Anopheles gambiae PESTAnnotated

Ixodes scapularis Sequencing

Culex pipiens quinquefasciatus Assembly

Anopheles gambiae M & S formPediculus humanus


Glossina morsitans morsitans

Lutzomyia longipalpis

Phlebotomus papatasiRhodnius prolixus


Page 4: VectorBase BRC4 20061 The evolving VectorBase gene build: mixing automated and manual approaches when annotating vector genomes Daniel Lawson VectorBase-EBI,

VectorBase BRC4 2006 4

Annotation of new genomes

Assembled genome

VectorBase gene predictions

Sequencing centre gene predictions

Merge into canonical set

Protein analysis

Display on genome browser

Release to GenBank/EMBL/DDBJ

Page 5: VectorBase BRC4 20061 The evolving VectorBase gene build: mixing automated and manual approaches when annotating vector genomes Daniel Lawson VectorBase-EBI,

VectorBase BRC4 2006 5

VectorBase gene prediction pipeline

Blessed predictions

Community submissionsManual annotations

Species-specific predictions

Similarity predictions

Transcript based predictions

Ab initio gene predictions

Canonical predictions

(Genewise) (Genewise)

(SNAP) (Exonerate)

(Apollo) (Genewise, Exonerate, Apollo)

Protein family HMMs(Genewise)

ncRNA predictions(Rfam)

Page 6: VectorBase BRC4 20061 The evolving VectorBase gene build: mixing automated and manual approaches when annotating vector genomes Daniel Lawson VectorBase-EBI,

VectorBase BRC4 2006 6

VectorBase curation database pipeline for manual/community

annotationCurationwarehouse db

Manual annotation (Harvard)



Community annotation (Community representatives)


Chado-XML Chad




Gene build db

Community annotation (in collaboration with Harvard)

Page 7: VectorBase BRC4 20061 The evolving VectorBase gene build: mixing automated and manual approaches when annotating vector genomes Daniel Lawson VectorBase-EBI,

VectorBase BRC4 2006 7

Manual annotation progress

Protein-coding gene No.



Anopheles gambiae

AgamP3.3 13,277 261 ( 2.0 %) 667 ( 5.0 %)

current 2474 (18.6 %) 667* ( 5.0 %)

Aedes aegypti

AaegL1.1 15,419 0 ( 0.0 %) 0 ( 0.0 %)

current 0 ( 0.0 %) 341 ( 2.2 %)

Page 8: VectorBase BRC4 20061 The evolving VectorBase gene build: mixing automated and manual approaches when annotating vector genomes Daniel Lawson VectorBase-EBI,

VectorBase BRC4 2006 8

Manual annotation visualisation

Page 9: VectorBase BRC4 20061 The evolving VectorBase gene build: mixing automated and manual approaches when annotating vector genomes Daniel Lawson VectorBase-EBI,

VectorBase BRC4 2006 9

Overview of proposed re-annotation system

Blessed genes

Current gene set


Species-specific gene prediction

New gene build


Updated gene set

Full gene build

Partial Gene build

Page 10: VectorBase BRC4 20061 The evolving VectorBase gene build: mixing automated and manual approaches when annotating vector genomes Daniel Lawson VectorBase-EBI,

VectorBase BRC4 2006 10

Comparing new gene builds with the old one

• Use of manual annotation for validation of automated gene build improvements

• Simple statistics (CDS length, intron size, CDS matching TE’s)

• BRC annotation metrics– Supporting evidence for a gene prediction (citation,

expression, orthology)– Attachment of Standard Operating Procedures


Page 11: VectorBase BRC4 20061 The evolving VectorBase gene build: mixing automated and manual approaches when annotating vector genomes Daniel Lawson VectorBase-EBI,

VectorBase BRC4 2006 11

VectorBase gene prediction pipeline (SOP)

Blessed predictions

Community submissionsManual annotations

Species-specific predictions

Similarity predictions

Transcript based predictions

Ab initio gene predictions

Canonical Gene set

VB:SOP0001 VB:SOP0002 & SOP0003

VB:SOP0005 VB:SOP004

Protein family HMMsVB:SOP0009

ncRNA predictionsVB:SOP0008


Page 12: VectorBase BRC4 20061 The evolving VectorBase gene build: mixing automated and manual approaches when annotating vector genomes Daniel Lawson VectorBase-EBI,

VectorBase BRC4 2006 12

Gene build schedules

Full gene build

Partial gene build

4 months

1 month

Triggers for re-annotation

• Temporal

• Data

• New data for species

• New genomes

• Re-annotated genomes

Page 13: VectorBase BRC4 20061 The evolving VectorBase gene build: mixing automated and manual approaches when annotating vector genomes Daniel Lawson VectorBase-EBI,

VectorBase BRC4 2006 13

3 wise annotators

Page 14: VectorBase BRC4 20061 The evolving VectorBase gene build: mixing automated and manual approaches when annotating vector genomes Daniel Lawson VectorBase-EBI,

VectorBase BRC4 2006 14

Page 15: VectorBase BRC4 20061 The evolving VectorBase gene build: mixing automated and manual approaches when annotating vector genomes Daniel Lawson VectorBase-EBI,

VectorBase BRC4 2006 15

Merging gene sets

Reduce to single predictions per locus

Compare exon/intron structures

Gene set #1

Gene set #2

Identical structures

Compatible structures

Different structures

Merge/Split structures Complex No Map

Add isoform predictions based on EST/Peptide data

Canonical gene set