the barcode data standard: cbol’s partnership with the international nucleotide sequence database...

35
The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary National Museum of Natural History Smithsonian Institution [email protected] ; http://www.barcoding.si.edu 202/633-0812; fax 202/633-2938

Upload: patrick-wilkins

Post on 18-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

The BARCODE Data Standard:

CBOL’s Partnership with the International Nucleotide Sequence

Database Collaboration (INSDC)

David E. Schindel, Executive SecretaryNational Museum of Natural History

Smithsonian Institution

[email protected]; http://www.barcoding.si.edu202/633-0812; fax 202/633-2938

Page 2: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

Infrastructure of Taxonomy:Fragmented, Disconnected

• Collections and databases of specimens

• Seedbanks, culture/cell line collections

• Compilations of taxonomic names

• Floristic and faunistic surveys/inventories

• Monographs, Taxonomic revisions

• Data repositories (gene sequences, characters, images, trees)

• The (undigitized) Taxonomic Literature

Page 3: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

Linking Logical Categories (1):Specimens, Names, Opinions

Journal Publication

Species Name

Voucher Specimen

??

Page 4: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

Linking Logical Categories (2):Naming and defining species

Journal Publication

Species Name

Voucher Specimen

Holotype specimens

Page 5: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

Linking Logical Categories (3):Establishing species boundaries

Journal Publication

Species Name

Voucher Specimen

??

Species concept beyond holotype

- Paratype series - Typological versus population thinking - Genetic lineages - BSC (hard to apply)

Page 6: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

Linking Logical Categories (4):Interpreting species boundaries

Journal Publication

Species Name

Voucher Specimen

??

Other assigned specimens:

•Species philosophy of original author

•Interpretation of user

Page 7: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

Databases of Names, Specimens, Species Distributions

Journal Publication

Species Name

Voucher Specimen

Authority files of taxonomic

names

Museum databases of

associated dataDatabases of species

occurrences and distribution (OBIS)

Page 8: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

DNA Barcodes:A Key Variable for Biodiversity

Informatics

Journal Publication

Species Name

Voucher Specimen

Barcode Sequence

Authority files of taxonomic

names

Museum databases of

associated dataDatabases of species

occurrences and distribution (OBIS)

Page 9: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

CBOL’s Working Groups

• Database: Designing/constructing the Barcode Section of GenBank

• DNA: Protocols for formalin-fixed and old museum specimens; Producing LIMS for dissemination

• Data Analysis: Beyond phenetic methods; population genetics perspective

• (Plants: Initiated discussions of plant barcode gene region(s))

Page 10: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

BARCODE Data Standards• Consultations with GenBank, ITIS, museum

database developers, GBIF, ISIS, from 2004

• Consensus results of Front Royal meeting– GBIF ITIS GRIN– NBII Species2000 IPNI– ICZN ZooRecord OBIS

• GenBank Proposed to International Nucleotide Sequence Database Collaboration (EMBL, DDBJ)

• Approved by CBOL and INSDC mid-2005

Page 11: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

Reserved Keyword “BARCODE”• GenBank reviews records against standard

• Adds keyword “BARCODE” in annotation field

• Can be removed by CBOL

Page 12: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

Requirements• Species name selected from authority

• Sequence from COI or other barcode region approved by CBOL

• Structured link to voucher specimen

• Online access to metadata

• Trace files and quality scores

• Primer sequences and names

• Minimum sequence length (500bp for COI)

• Geographic locality

Page 13: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

Recommended fields, added to INSDC at CBOL’s request

• Latitude and longitude

• Name of the identifier

• Name of the collector

• Date of collection

Page 14: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

New Data Fields

Latitude/Longitude

Collection date

Collector’s name

Identifier’s name

Page 15: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

BARCODE Keyword in GenBank

Page 16: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

Barcode Sequence

Voucher Specimen

Species Name

Specimen Metadata

Literature(link to content or

citation)

BARCODE Records in INSDC

Indices - Catalogue of Life - GBIF/ECAT

Nomenclators - Zoo Record - IPNI - NameBank

Publication links - New species

GeoreferenceHabitat

Character setsImages

BehaviorOther genes

Trace filesOther

DatabasesPhylogenetic

Pop’n GeneticsEcological

Primers

Databases - Provisional sp.

Page 17: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

Barcode Sequence

Voucher Specimen

Species Name

Specimen Metadata

Literature(link to content or

citation)

Structured link to Vouchers

Indices - Catalogue of Life - GBIF/ECAT

Nomenclators - Zoo Record - IPNI - NameBank

Publication links - New species

GeoreferenceHabitat

Character setsImages

BehaviorOther genes

Trace filesOther

DatabasesPhylogenetic

Pop’n GeneticsEcological

Primers

Databases - Provisional sp.

Page 18: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

What constitutes a voucher?

• Long-term reference tied to BARCODE

• Corroborates the species identification

• Provides additional tissue

• CBOL relies on community decisions:– Full specimen?– Parts for morphologic features (e.g., feather?) – Frozen tissue?– E-Vouchers for large specimens, destructive

samples, catch-and-release?

Page 19: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

Where’s the voucher?

Page 20: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

Linking to Vouchers

Structured Voucher IDs

Page 21: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

• Based on Darwin Core

• Eventually will be replaced by GUID

• Triplet:

Institution Acronym : Collection : Specimen #

NMNH : FISH : 123456

• CBOL, GBIF and NCBI discussing global registry of:– Institutional acronyms– Collection codes– “Pre-accession” specimen IDs

Voucher Specimen ID

Page 22: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

Barcode Sequence

Voucher Specimen

Species Name

Specimen Metadata

Literature(link to content or

citation)

Link to Species Names

GeoreferenceHabitat

Character setsImages

BehaviorOther genes

Trace filesOther

DatabasesPhylogenetic

Pop’n GeneticsEcological

Primers

Databases - Provisional sp.

Indices - Catalogue of Life - GBIF/ECAT

Nomenclators - Zoo Record - IPNI - NameBank

Publication links - New species

Page 23: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

Species names in INSDC

Page 24: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

NCBI Taxonomy BrowserThe good, the bad, and the ugly

• Species names provided by submitters

• Checked against compilations

• Linkout to Catalogue of Life, other sources

• Names not found added to Taxonomy Browser

• Submitters informed of errors but not forced to make corrections

Page 25: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

NCBI Taxonomy Browser

Page 26: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

NCBI Taxonomy BrowserSome names have no other source

Page 27: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

Other names linked to GBIF and Catalogue of Life…

Page 28: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

…and primary data source

Page 29: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

Authoritative Species Lists

• Catalogue of Life

• Species lists compiled by barcoding projects– FISH-BOL from FishBase, CoF– MBI mosquito catalog

• Nomenclators

• NameBank

• New names in publications

• Eventually, central registries (e.g., ZooBank)

Page 30: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

Provisional Species ID• Uncertain identifications

• Species complexes

• Newly discovered variants

• Ecogenomic samples

• Need general guidelines to ensure:– Globally unique, – Stable, retrievable– Can’t be confused with valid species name

Page 31: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

Barcode Sequence

Voucher Specimen

Species Name

Specimen Metadata

Literature(link to content or

citation)

BARCODE Records in INSDC

Indices - Catalogue of Life - GBIF/ECAT

Nomenclators - Zoo Record - IPNI - NameBank

Publication links - New species

GeoreferenceHabitat

Character setsImages

BehaviorOther genes

Trace filesOther

DatabasesPhylogenetic

Pop’n GeneticsEcological

Primers

Databases - Provisional sp.

Page 32: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

Improving links to taxonomic journals

Connecting taxonomic articles

Page 33: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

Links to Taxonomic Literature• Library-Laboratory meeting in London,

2005, on electronic access to taxonomic literature

• Led to formation of Biodiversity Heritage Library initiative

• Proactive steps with PubMed to add taxonomic journals to online abstracts

• Aggressive negotiation with publishers of barcoding papers

• Involvement in Encyclopedia of Life

Page 34: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

Long-term data curationof BARCODE records

Data records assembled

IDs consistent with other records?

Compliant with BARCODE standards?

Data records released on

INSDC

Data records published in

BOLD

Community feedback

Update records

(audit trail of species names

retained)

CBOL control of BARCODE

flag

GenBank adds BARCODE flag

Page 35: The BARCODE Data Standard: CBOL’s Partnership with the International Nucleotide Sequence Database Collaboration (INSDC) David E. Schindel, Executive Secretary

Acknowledgements

Robert Hanner, University of Guelph, Chair of CBOL’s Database Working Group

Scott Federhen, NCBI Taxonomy Browser

Donald Hobern, Head of Informatics, GBIF