centralized model organism database (biocuration 2014 poster)

1
A Centralized Model Organism Database (CMOD) for the Long Tail of Genomes ABSTRACT Andrew I. Su, Benjamin M. Good, Chinmay Naik and Adriel Carolino The Scripps Research Institute, La Jolla, California, USA Background How Gene Wiki? We acknowledge support from the National Institute of General Medical Sciences (GM089820 and GM083924). CONTACT Benjamin Good: [email protected] , @bgood Andrew Su: [email protected] , @andrewsu How Gene Wiki? The CMOD vision GENE WIKI EXAMPLE ABSTRACT Progress and status CONCLUSION One: structure from text mining The Dark Matter of genome annotation We need more hands on deck! We have multiple positions open for postdocs and programmers interested in crowdsourcing and bioinformatics projects (like CMOD)! 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 2017 2019 2021 2023 2025 1 10 100 1000 10000 100000 1000000 Bacteria Eukaryotes Archaea Model organism databases (MODs) are fantastic resources for organizing genomic information for commonly-studied organisms. To facilitate the creation and maintenance of MODs, the Generic Model Organism Database (GMOD) Project provides “a set of interoperable open-source software components for visualizing, annotating, and managing biological data.” Provide a database of the world’s knowledge that anyone can edit. - Denny Vrandečić Despite the obvious success and value of GMOD, the number of sequenced genomes is growing exponentially. Does this model scale with the rate of genome sequencing? Figure courtesy Scott Cain Wikidata (http://wikidata.org ) is an innovative and important new tool for community-based knowledge management. Wikidata is supported by the Wikimedia Foundation, which also operates Wikipedia. In short, Wikidata is to structured data what Wikipedia is to free text. Model organism databases are fantastic resources for genomics researchers. But relatively few model organisms have stable funding for their database, and the number of sequenced genomes is increasing exponentially. It seems impractical to create and fund a model organism database for each of them. Here, we describe our efforts to build a Centralized Model Organism Database (CMOD), a single online resource to support all genomes and organisms. To scale to the Long Tail of Genomes, CMOD employs an open editing model in which the entire research community is empowered to edit and maintain genomic data. We describe our efforts to systematically populate CMOD with two core data types across all organisms – genome annotations and Gene Ontology annotations. We propose to build a Centralized Model Organism Database (CMOD), which would house gene and genome annotations for all genomes. This database would be based on Wikidata, enabling it to be community-curated , continuously-updated , and computer-readable . CMOD Gene and genome annotations CMOD data can be accessed using a number of mechanisms. The Wikidata web interface offers convenient access using a web browser. The Wikidata application programming interface (API) and associated programming libraries allow programmers and bioinformaticians computational access to the data. Wikidata export to RDF offers compatibility with the Semantic Web and Linked Data. We also envision that many popular GMOD tools , including Gbrowse, Jbrowse, and and WebApollo, can be modified to use CMOD as the back-end data warehouse. Wikidata Wikidata currently catalogs over 14 million entities, and describes those entities in the form of 27 million statements. This knowledgebase is the product of over 50 million edits. Of those edits, ~90% are contributed by bots that predominantly import data from structured resources, and 10% are contributed by human editors. This seminal paper identified 517 operons and 103 small regulatory RNAs in Listeria monocytogenes, an important human pathogen. Unfortunately, these annotations cannot be downloaded from the Broad’s “Listeria monocytogenes Database”, nor NCBI Genome, nor UCSC’s Microbial Genome Browser, nor EnsemblBacteria, nor any GMOD instance. The only place they are available is from the Supplementary information on the Nature website in PDF format . We have loaded gene and genome annotation data for ~1000 human genes, the human proteins they encode, and their mouse orthologs according to the data model shown above. The code repository for managing these data is available at https:// bitbucket.org/sulab/wikidatagenebot . The Skeptic’s Corner Will CMOD scale with the exponential growth in sequenced genomes? Yes, because there is no gatekeeper to adding new content. Anyone is empowered to directly contribute. Even though the technical infrastructure is centralized, the data management is highly distributed. Who will contribute to CMOD? We envision a wide spectrum of contributors, from large biocuration/annotation centers adding large data sets, to individual bioinformaticians who deposit structured versions of previously unstructured data, to individual scientists contributing individual annotations. Will CMOD content be trustworthy? Like Wikipedia, we expect that Wikidata overall will asymptotically approach perfect accuracy and completeness. Moreover, Managing genomic information and knowledge is a critical challenge for biomedical research. Community infrastructure that allows individuals to collaboratively and collectively organize knowledge has the potential to be an enabling technology in biological research. Here, we propose CMOD as one such application that is particularly focused on the Long Tail of sequenced genomes. Cumulative number of sequenced genomes

Upload: andrew-su

Post on 10-May-2015

585 views

Category:

Science


0 download

DESCRIPTION

A Centralized Model Organism Database (CMOD) for the Long Tail of Genomes Presented at Biocuration 2014 in Toronto http://biocuration2014.events.oicr.on.ca/ See related slides at http://www.slideshare.net/andrewsu/20140116-gmod-short

TRANSCRIPT

Page 1: Centralized Model Organism Database (Biocuration 2014 poster)

A Centralized Model Organism Database (CMOD) for the Long Tail of Genomes

ABSTRACT

Andrew I. Su, Benjamin M. Good, Chinmay Naik and Adriel Carolino

The Scripps Research Institute, La Jolla, California, USA

Background

How Gene Wiki?

We acknowledge support from the National Institute of General Medical Sciences (GM089820 and GM083924).

CONTACT

Benjamin Good: [email protected], @bgood

Andrew Su: [email protected], @andrewsu

How Gene Wiki? The CMOD visionGENE WIKI EXAMPLEABSTRACT

FUNDING

Progress and status

CONCLUSION

One: structure from text miningThe Dark Matter of genome annotation

We need more hands on deck! We have multiple positions open for postdocs and programmers interested in crowdsourcing and bioinformatics projects (like CMOD)!

1997

1999

2001

2003

2005

2007

2009

2011

2013

2015

2017

2019

2021

2023

2025

1

10

100

1000

10000

100000

1000000

Bacteria

Eukaryotes

Archaea

Model organism databases (MODs) are fantastic resources for organizing genomic information for commonly-studied organisms. To facilitate the creation and maintenance of MODs, the Generic Model Organism Database (GMOD) Project provides “a set of interoperable open-source software components for visualizing, annotating, and managing biological data.”

Provide a database of the world’s knowledge that

anyone can edit.- Denny Vrandečić

Despite the obvious success and value of GMOD, the number of sequenced genomes is growing exponentially. Does this model scale with the rate of genome sequencing?

Figure courtesy Scott Cain

Wikidata (http://wikidata.org) is an innovative and important new tool for community-based knowledge management. Wikidata is supported by the Wikimedia Foundation, which also operates Wikipedia. In short, Wikidata is to structured data what Wikipedia is to free text.

Model organism databases are fantastic resources for genomics researchers. But relatively few model organisms have stable funding for their database, and the number of sequenced genomes is increasing exponentially. It seems impractical to create and fund a model organism database for each of them. Here, we describe our efforts to build a Centralized Model Organism Database (CMOD), a single online resource to support all genomes and organisms. To scale to the Long Tail of Genomes, CMOD employs an open editing model in which the entire research community is empowered to edit and maintain genomic data. We describe our efforts to systematically populate CMOD with two core data types across all organisms – genome annotations and Gene Ontology annotations.

We propose to build a Centralized Model Organism Database (CMOD), which would house gene and genome annotations for all genomes. This database would be based on Wikidata, enabling it to be community-curated, continuously-updated, and computer-readable.

CMOD

Gene and genome annotations

CMOD data can be accessed using a number of mechanisms. The Wikidata web interface offers convenient access using a web browser. The Wikidata application programming interface (API) and associated programming libraries allow programmers and bioinformaticians computational access to the data. Wikidata export to RDF offers compatibility with the Semantic Web and Linked Data. We also envision that many popular GMOD tools, including Gbrowse, Jbrowse, and and WebApollo, can be modified to use CMOD as the back-end data warehouse.

Wikidata

Wikidata currently catalogs over 14 million entities, and describes those entities in the form of 27 million statements. This knowledgebase is the product of over 50 million edits. Of those edits, ~90% are contributed by bots that predominantly import data from structured resources, and 10% are contributed by human editors.

This seminal paper identified 517 operons and 103 small regulatory RNAs in Listeria monocytogenes, an important human pathogen. Unfortunately, these annotations cannot be downloaded from the Broad’s “Listeria monocytogenes Database”, nor NCBI Genome, nor UCSC’s Microbial Genome Browser, nor EnsemblBacteria, nor any GMOD instance. The only place they are available is from the Supplementary information on the Nature website in PDF format.

We have loaded gene and genome annotation data for ~1000 human genes, the human proteins they encode, and their mouse orthologs according to the data model shown above. The code repository for managing these data is available at https://bitbucket.org/sulab/wikidatagenebot.

The Skeptic’s Corner

Will CMOD scale with the exponential growth in sequenced genomes? Yes, because there is no gatekeeper to adding new content. Anyone is empowered to directly contribute. Even though the technical infrastructure is centralized, the data management is highly distributed.

Who will contribute to CMOD? We envision a wide spectrum of contributors, from large biocuration/annotation centers adding large data sets, to individual bioinformaticians who deposit structured versions of previously unstructured data, to individual scientists contributing individual annotations.

Will CMOD content be trustworthy? Like Wikipedia, we expect that Wikidata overall will asymptotically approach perfect accuracy and completeness. Moreover, because provenance is a core part of the data model, the presence/absence/type of the reference can be used to systematically filter the knowledgebase according to each user’s needs.

Managing genomic information and knowledge is a critical challenge for biomedical research. Community infrastructure that allows individuals to collaboratively and collectively organize knowledge has the potential to be an enabling technology in biological research. Here, we propose CMOD as one such application that is particularly focused on the Long Tail of sequenced genomes.

Cumulative number of sequenced genomes