improving curation efficiency: user contributions and textpresso-based semi-automation sab 2008...

15
Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso

Upload: caitlin-smith

Post on 02-Jan-2016

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso

Improving Curation Efficiency: User Contributions and Textpresso-Based

Semi-Automation

SAB 2008

WormBase Literature Curators Textpresso

Page 2: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso

SAB 2008

User submission (email, web forms)

First-pass curation

Institution: Sanger InstituteSUBMITTED FROM PAGE: http://www.wormbase.org/db/seq/gbrowse/elegans/

COMMENT TEXT: Dear WormBase, I think that WormBase may be missing a gene between Y50E8A.6 and Y50E8A.7......

How does data get into WormBase?

Page 3: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso

SAB 2008

Publication

Flagging/Triage

Curation

Current first-pass curation pipeline

Page 4: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso

SAB 2008

Growing desire amongst biocurators for user submissions

First people to know what data is in a paper is the authors

TAIR – partnered with Plant Physiology web interface for data submission (February 2008) voluntary, link included in acceptance letter

Submitter

email

Paper identifier

Locus name

Term/descriptor,method

User submissions: first-pass flagging/triage

Page 5: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso

SAB 2008

User-submitted first-pass flags - WormBase

Page 6: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso

SAB 2008

User data-submission forms: Expression Pattern

Page 7: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso

SAB 2008

Full-text searching

Keywords and/or categories

Data extraction: Textpresso

Müller, Kenny, and Sternberg. PLoS Biology, November, 2004.

Page 8: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso

SAB 2008

Paper – entity association: pattern matching

Transgenes (Wen): WBPaper00031242 – gqIs3, gqIs35, oxIs12

Fact extraction: specialized categories

Genetic interactions (Andrei): eor-2(op166) suppresses HSN death in the strong tra-1(e1099) background, but not noticeably in the weaker tra-1(e1076) background.

GO cellular component curation (Kimberly): ...positions of these neurons are indicated with circles and localizations of GAR-3::YFP on the cell membranes are denoted by arrows.

Textpresso: What data types?

Page 9: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso

SAB 2008

Textpresso-mediated CC curation: from sentences to annotations

Page 10: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso

SAB 2008

Transgenes: 1,100 new paper-transgene connections 250 new transgenes

checked manually – 95% accuracy ultimately, connections will go directly into database

Genetic Interactions: 1,875 (1/2007 – 5/2008) ~5,600 total interactions keeping current with new papers

GO Cellular Component Annotations: 515 (1/2007 – 5/2008) 2-3X rate prior to categories nearly complete keeping up with new data (1-2 hours/week)

Textpresso: How much data?

Page 11: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso

Textpresso: Other data types

How else can we use Textpresso?

Other data types: Molecular Function Assays, Gene Product Interactions

Pilot: GO molecular function annotations for protein kinase activitykeyword: phosphorylatecategory: C. elegans proteins

13 new GO annotations/hour

Extension of this: protein modifications – not yet captured in WB

Pilot: Gene product interactions for WB and BINDkeywords: physically interact

category: C. elegans proteins310 matches in 237 documents22 physical interactions – top 15 papers

Page 12: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso

Textpresso for triage: Classifying text based on content

Multiple strategies (using existing first-pass papers as training set):

Organismal triage – C. elegans, Drosophila

Identify, prioritize information-rich papers

Flag for specific data types

Multiple levels:

Machine learning – SVM (Support Vector Machine)Word frequency analysis

Hand-crafted categories

Combine SVM and categories

Supplement with word weighting, contextual analyses

Page 13: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso

SAB 2008

Keeping better track of curation statistics.....

Page 14: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso

SAB 2008

.....and making curation statistics more transparent to users.

Users could search for curation status of any paper

Users could search for curation status of a given data type

Each database release would report newly curated papers

Each database release would document increases in data-type curation

Page 15: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso

WormBase Literature Curation

Gene Symbols, Alleles,Sequence Features,

Mapping Data:Mary Ann Tuli, Sanger

Gene Function: Concise Descriptions,Gene Ontology:

Ranjana Kishore, CaltechErich Schwarz, Caltech

Kimberly Van Auken, Caltech

Mutant Phenotypes (RNAi and Alleles):Igor Antoshechkin, CaltechJolene Fernandez, Caltech

Raymond Lee, CaltechGary Shindelman, Caltech

Karen Yook, Caltech

First Pass, Genetic Interactions:

Andrei Petcherski, Caltech

Gene Regulation, PWMs:Xiaodong Wang, CaltechErich Schwarz, Caltech

Expression Patterns, Antibodies, Transgenes:

Wen Chen, Caltech

Anatomy Ontology, Cell Function:

Raymond Lee, CaltechMicroarrays, SAGE:

Igor Antoshechkin, Caltech

Sequence, Gene Structures:Sanger, Wash UAuthors, Papers: Cecilia Nakamura, Daniel Wang

Curation Tools, Database:Juancarlos Chan, Caltech