what's true for e. coli… enlisting the community in ongoing genome annotation jim hu...

32
What's True For E. coli… Enlisting The Community In Ongoing Genome Annotation Jim Hu EcoliHub/EcoliWiki Texas A&M University

Upload: rebecca-johnson

Post on 31-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

What's True For E. coli…

Enlisting The Community In Ongoing Genome Annotation

Jim Hu

EcoliHub/EcoliWiki

Texas A&M University

Why more E. coli websites?

• The number of E. coli databases is large• Extensive coverage exists for many aspects of E. coli biology• Journals contain half a century of E. coli data• Don't we already know everything?

Why more E. coli websites?

• The number of E. coli databases is large• Extensive coverage exists for many aspects of E. coli biology• Journals contain half a century of E. coli data• Don't we already know everything?

• #(1-3) The problem isn't the amount of information, it's finding it

• #4: No

The diversity of information on different genomes, proteins, phenotypes and so on makes it difficult to keep track of all details.

Molecular Systems Biology 3:128 (2007)

Why more E. coli websites?

• Part of what we don't know yet is how the things we do know fit together

• Most of us need help mining what's out there

The diversity of information on different genomes, proteins, phenotypes and so on makes it difficult to keep track of all details.

Molecular Systems Biology 3:128 (2007)

1-2:30 today: Session 173/KPoster K-133, Board 0542 EcoliHub: Development of the Information Resource

Problems and approaches

• Finding data from different resources– EcoliHub - information from collaborating biological electronic data

resources

• Making data curation faster, cheaper, and better– EcoliWiki - community annotation for E. coli K-12

• Community functional curation for cross-species comparison– GONUTS - a community Gene Ontology resource

1-2:30 today: Session 173/KPoster K-133, Board 0542 EcoliHub: Development of the Information Resource

Integrating information from multiple sites

• EcoliHub is based on web services

• A user query to EcoliHub is passed on to participating sites

http://ecolihub.org or

http://ecolicommunity.org

Integrating information from multiple sites

• EcoliHub is based on web services

• A user query to EcoliHub is passed on to participating sites

• EcoliHub gathers the responses and assembles output for the user

http://ecolihub.org or

http://ecolicommunity.org

Integrating information from multiple sites

Integrating information from multiple sites

• But the users won't have to start at the EcoliHub site

Integrating information from multiple sites

• But the users won't have to start at the EcoliHub site

• EcoliHub will provide the infrastructure to help member sites do peer-to-peer queries

who has info?

Try EcoCyc and

RegulonDB

Integrating information from multiple sites

• But the users won't have to start at the hub site

• EcoliHub will provide the infrastructure to help member sites do peer-to-peer queries

• The users don't need to know or care about the EcoliHub

What kinds of nodes are connected to EcoliHub?

• So far:– EcoCyc

• everything E. coli; professionally curated

– EcoGene*• everything E. coli; professionally curated

– GenoBase• functional genomics and resources

– EcoliPredict• protein structure models

– OU GenExpDB• transcriptomes, experimental data

– RegulonDB*• operons and regulons

– EcoliWiki• everything E. coli; community curated

– GONUTS• Community curation of the Gene Ontology; not just E. coli

• More coming…

The need for Annotation is growing

“What is true of Escherichia coli is true of the elephant” - Jacques Monod

“Thanks to annotation creep, what’s false for E. coli is false for the elephant too”

- Jim Hu

“What is true of Escherichia coli is true of the elephant” - Jacques Monod

“Thanks to annotation creep, what’s false for E. coli is false for the elephant too”

- Jim Hu

http://www.pasteur.fr/infosci/archives/mon/im_ele.html

People are limiting for annotation

• Major MODs (EcoCyc, SGD, Wormbase, Flybase, MGI, Zfin, TAIR etc.) employ large numbers of PhD-level curators

• This model problematic for the future of biocuration, and not just for E. coli– Curators are expensive

• NIH and NSF cannot afford to staff every organism at this level

– Broad expertise across all areas is hard• Curators have to read papers in areas they were not trained in. • Curators may not recognize the significance of papers in areas they

were not trained in

• Can we make it:– cheaper?– faster?– better?

The Wikipedia approach

• Get your user community to work for free!• Many groups have tried community annotation, with mixed

success (at best)• Wikipedia has added more than a million articles in English since I

made the first version of this slide!

EcoliWiki

http://ecoliwiki.org or .net or .com

or come from EcoliHub

EcoliWiki philosophy

• Any registered user can edit• Any registered user can

register new users• Any registered user can

create new pages• It's easier to revise than to

create new content– Seed content from other

places, mostly EcoCyc

• Any registered user can edit• Any registered user can

register new users• Any registered user can

create new pages• It's easier to revise than to

create new content– Seed content from other

sites, mostly EcoCyc

But won't that invite chaos?

GenBank's managers are dead set against letting users into GenBank's files, however. They say there already are procedures to deal with errors in the database, and researchers themselves have created secondary databases that improve on what GenBank has to offer. "That we would wholesale start changing people's records goes against our idea of an archive," says David Lipman, director of the National Center for Biotechnology Information (NCBI), GenBank's home in Bethesda, Maryland. "It would be chaos."

GenBank's managers are dead set against letting users into GenBank's files, however. They say there already are procedures to deal with errors in the database, and researchers themselves have created secondary databases that improve on what GenBank has to offer. "That we would wholesale start changing people's records goes against our idea of an archive," says David Lipman, director of the National Center for Biotechnology Information (NCBI), GenBank's home in Bethesda, Maryland. "It would be chaos."

Correct compared to what?

NCBI RefSeq:

Wikipedia:

Correct compared to what?

NCBI RefSeq:

Wikipedia:

Correct compared to what?

NCBI RefSeq:

Wikipedia:

Correct compared to what?

This is how biology achieves fidelity

A collage of books I haven’t read

Biology Wikis are proliferating

Participation is the major challenge• Anyone can edit ≠ Anyone will edit• Wikipedia: a tiny fraction of the users edit anything

– A tiny fraction of those do major editing

– Really big denominator

• Outreach to increase our user base

Participation is the major challenge• Tools to make it easier to edit

Participation is the major challenge• Biggest difference from other systems:

– Partial annotations are wanted– It doesn't matter if you don't know the wiki markup– It doesn't matter if what you're adding isn't fully worked out

• Someone else can fix it• And you can fix what others write

Community annotation for everyone

• What if I don't work on E. coli?• Community annotation of gene function via the Gene Ontology• Gene Ontology Normal Usage Tracking System (GONUTS)• http://gowiki.tamu.edu

Community annotation for everyone

• Annotation pages based on UniProt IDs

The future of EcoliHub and EcoliWiki

• Making the resource more useful to the community– incorporating more resources

– providing integration workflows

– teaching users how to use them

– adding content people want

• Making the approach available to other biology communities– reusable open source tools

– public web services

E. COLI

2008

don't forget the acknowledgements!

Thanks to

• EcoliWiki/GONUTS Team– Chris Elsik– Gwen Knapp– Debby Siegele– Daniel Renfro– Jerry Tsai– Xiaotao Qu– Rosemarie Swanson– Anand Venkatraman– Adrienne Zweifel

• Sabbatical hosts– SGD/Stanford– Stein Lab/CSHL

• GO consortium

• EcoliHub Team Leaders– Barry Wanner PI, Purdue– Walid Aref, co-PI, Purdue– Tyrell Conway, co-PI, Oklahoma– Mike Gribskov, co-PI, Purdue– Peter Karp, co-PI, SRI– Daisuke Kihara, co-PI, Purdue

• Funding NIH U24-GM077905

URLs: http:ecolihub.org

http:ecoliwiki.org

http:gowiki.tamu.edu