![Page 1: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/1.jpg)
Crowdsourcing to structure biological knowledge
Andrew Su, Ph.D.Department of Molecular and Experimental Medicine
The Scripps Research Institute
ISI, USC
August 16, 2012
![Page 2: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/2.jpg)
Human genetics underlies human health2
~3 billion bases
~23,000 genes
Molecular diagnostics & therapeutics
Molecular understanding of:• Biological function• Genetic variation• Mutation• Deletion• Amplification• …
“Gene annotation”
![Page 3: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/3.jpg)
Structured gene annotations enable computation3
Structured annotations
![Page 4: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/4.jpg)
Few genes are well annotated4
38%
59%
TP53TNFAPOEMTHFRIL6HLA-DRB1VEGFAEGFRTGFB1ACE
Data: NCBI gene2pubmed, August 2010
23,278 protein-coding genes
Genes, sorted by decreasing counts
Co
un
ts
Gene ontology (GO)
PubMed
![Page 5: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/5.jpg)
Biocuration is a key annotation bottleneck5
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
0
200,000
400,000
600,000
800,000
1,000,000
Number of PubMed-indexed articles
![Page 6: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/6.jpg)
6
311,696 articles (1.5% of PubMed)have been cited by GO annotations
![Page 7: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/7.jpg)
7
0
Sooner or later, the research community will
need to be involved in the annotation effort to scale
up to the rate of data generation.
![Page 8: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/8.jpg)
The Long Tail is a prolific source of content8
ShortHead
Long Tail
Content produced
Contributors (sorted)
News :Video:
Product reviews:Food reviews:Talent judging:
NewspapersTV/Hollywood
Consumer reportsFood criticsOlympics
BlogsYouTube
Amazon reviewsYelp
American Idol
![Page 9: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/9.jpg)
9
We can harness the Long Tail of scientists to directly participate in
the gene annotation process.
![Page 10: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/10.jpg)
From crowdsourcing to structured data10
The Gene Wiki
Biological Games
![Page 11: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/11.jpg)
10,000 gene “stubs” within Wikipedia11
Protein structure
Symbols and identifiers
Tissue expression pattern
Gene Ontology annotations
Links to structured databases
Gene summary
Protein interactions
Linked references
Huss, PLoS Biol, 2008
Utility
Users
Contributors
![Page 12: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/12.jpg)
Gene Wiki has a critical mass of readers12
Total: 4.0 million views / month
Huss, PLoS Biol, 2008; Good, NAR, 2011
![Page 13: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/13.jpg)
Gene Wiki has a critical mass of editors13
Increase of ~10,000 words / month from >1,000 editsCurrently 1.42 million words
Approximately equal to 230 full-length articles
Good, NAR, 2011
Edi
tor
coun
t Editors
Edits Edi
t co
unt
![Page 14: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/14.jpg)
A review article for every gene is powerful14
Hyperlinks to related concepts
References to the literature
Reelin: 68 editors, 543 edits since July 2002
Heparin: 175 editors, 320 edits since June 2003
AMPK: 44 editors, 84 edits since March 2004
RNAi: 232 editors, 708 edits since October 2002
![Page 15: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/15.jpg)
Filtering, extracting, and summarizing PubMed
Documents
Concepts
![Page 16: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/16.jpg)
Document- and concept-centric text mining16
Subject Object
Predicate
![Page 17: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/17.jpg)
Simple text mining for gene annotations17
Wikilink
GO exact match
Gene Wiki mapping
NCBI Entrez Gene: 334
Candidate assertion
GO:0006897
6319 novel Gene Ontology annotations2147 novel Disease Ontology annotations
![Page 18: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/18.jpg)
Gene Wiki+ for integrative queries18
http://genewikiplus.org
mwsync
![Page 19: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/19.jpg)
Dynamic queries across genes, diseases, SNPs19
![Page 20: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/20.jpg)
20
![Page 21: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/21.jpg)
21
TOP 100 GENES
![Page 22: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/22.jpg)
Gene Wiki+ for integrative queries22
http://genewikiplus.org
mwsync
{{#ask: [[Category:Human_proteins]] [[is_associated_with:: <q>[[Category:Breast_cancer]]</q>]] [[HasSNP:: <q>[[is_associated_with:: <q>[[Category:Breast_cancer]]</q>]] </q>]]}}
…
OMIMPharmGKB
![Page 23: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/23.jpg)
OMIMPharmGKB
Gene Wiki+ for integrative queries23
http://genewikiplus.org
mwsync
![Page 24: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/24.jpg)
From crowdsourcing to structured data24
The Gene Wiki
Biological Games
![Page 25: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/25.jpg)
Not just the biomedical literature…25
![Page 26: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/26.jpg)
BioGPS aggregates gene-centric information26
http://biogps.org
![Page 27: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/27.jpg)
The plugin interface is simple and universal27
KEGGhttp://www.genome.jp/dbget-bin/www_bget?hsa:{{EntrezGene}}
STRINGhttp://string-db.org/newstring_cgi?...&identifier={{EnsemblGene}}
Pubmedhttp://www.ncbi.nlm.nih.gov/sites/entrez?...&Term={{Symbol}}
URL template
Gene entityRendered URL
![Page 28: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/28.jpg)
The plugin interface is simple and universal28
![Page 29: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/29.jpg)
The plugin interface is simple and universal29
![Page 30: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/30.jpg)
The plugin interface is simple and universal30
![Page 31: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/31.jpg)
The plugin interface is simple and universal31
![Page 32: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/32.jpg)
The plugin interface is simple and universal32
Total of 389 gene-centric online databases registered as BioGPS plugins
![Page 33: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/33.jpg)
BioGPS has a critical mass of users33
• > 4100 registered users• 4000 unique visitors per week• 40,000 page views per week
1. Harvard2. NIH3. UCSD4. Scripps5. MIT6. Cambridge
7. U Penn8. Stanford9. Wash U10. UNC
Top 10 organizations
Daily pageviews
![Page 34: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/34.jpg)
All resources should provide RDF…34
![Page 35: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/35.jpg)
Mining structured content from HTML35
![Page 36: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/36.jpg)
Defining a data extraction template36
…
TP53 TNF APOE IL6 VEGF …EGFR TGFB1
![Page 38: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/38.jpg)
All resources should provide flat files…38
![Page 39: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/39.jpg)
From crowdsourcing to structured data39
The Gene Wiki
Biological Games
![Page 40: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/40.jpg)
40
http://www.flickr.com/photos/archana3k1/4124330493/
Seven million human hours
![Page 41: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/41.jpg)
41
Twenty million human hours
http://www.flickr.com/photos/ableman/2171326385/
![Page 42: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/42.jpg)
-42
150 billion human hours
http://www.flickr.com/photos/rvp-cw/6243289302/
per year
![Page 43: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/43.jpg)
Using games to fold proteins43
Fold.it players have successfully:• Outperformed state of the art protein
folding algorithms (Cooper, Nature, 2010)
• Solved a previously-intractable crystal structure (Khatib, Nat Struct Mol Biol, 2011)
• Designed an improved protein folding algorithm (Khatib, PNAS, 2011)
• Improved enzyme activity of de novo designed enzyme (Eiben, Nat Biotechnol, 2011)
![Page 46: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/46.jpg)
Using games to annotate gene-disease links46
http://genegames.org
If its ‘right’, you get points
then on to the next question
Click the related disease
hurry!
![Page 47: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/47.jpg)
Dizeez players seem pretty smart…47
In total:• 207 unique gamers• 1045 games played• 8525 guesses
# Occurrences Gene Disease
7 GAST gastrinoma
7 RBP3 retinoblastoma
7 SSX1 synovial sarcoma
6 TG Graves' disease
6 CRYGC Cataract
6 SOX8 mental retardation
6 WRN Werner syndrome
6 ABL1 leukemia
6 MLL3 leukemia
6 SNAI2 breast carcinoma
Pubmed OMIM PharmGKB Gene Wiki
![Page 48: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/48.jpg)
Dizeez players seem pretty smart…48
# Occurrences Gene Disease
5 MECOM sarcoma
4 ATF7 cancer
3 ABCB5 acute myeloid leukemia
3 SART1 glioblastoma
3 NCK1 leukemia
3 NEK1 cancer
Pubmed OMIM PharmGKB Gene Wiki
In total:• 207 unique gamers• 1045 games played• 8525 guesses
![Page 49: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/49.jpg)
GenESP: Two-player annotation games49
![Page 50: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/50.jpg)
COMBO: Genomic predictors for disease50
cancer normal
find patterns
make predictions on new samples
cancer
normal
![Page 51: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/51.jpg)
COMBO: Genomic predictors for disease51
![Page 52: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/52.jpg)
COMBO: Genomic predictors for disease52
![Page 53: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/53.jpg)
COMBO: Genomic predictors for disease53
![Page 54: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/54.jpg)
COMBO: Genomic predictors for disease54
![Page 55: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/55.jpg)
COMBO: Genomic predictors for disease55
![Page 56: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/56.jpg)
COMBO: Genomic predictors for disease56
![Page 57: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/57.jpg)
57
We can harness the Long Tail of scientists to directly participate in
the gene annotation process.
![Page 58: Crowdsourcing to structure biological knowledge (USC/ISI)](https://reader033.vdocuments.site/reader033/viewer/2022052505/554e8076b4c905f66a8b5487/html5/thumbnails/58.jpg)
58
Doug Howe, ZFINJohn Hogenesch, U PennJon Huss, GNFLuca de Alfaro, UCSCAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum,
Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternMany Wikipedia editors
WP:MCB Project
Collaborators
Erik ClarkeBen GoodSalvatore Loguercio
Ian MacleodChunlei Wu
Group members
Funding and Support
(BioGPS: GM83924, Gene Wiki: GM089820)
Contacthttp://sulab.org
[email protected]@andrewsu+Andrew Su
Summer internships for students!
Recruiting graduate students in quantitative biology! See http://education.scripps.edu/