grandrounds2004.ppt
TRANSCRIPT
Mining Medical Mountains: How Bioinformatics Can Help
Medical Science
David Wishart
University of Alberta
The Library of Congress
• 120 million items in storage• 54 million manuscripts• 18 million books• 12 million photographs• 4.5 million maps• 4.4 million technical reports• 1.1 million PhD dissertations• ~20 Terabytes of data
Some Numbers…• 3 scientific journals in 1750• 120,000 scientific journals today• 500,000 medical articles/year• 4,000,000 scientific articles/year• 14,000,000 abstracts in PubMed derived from
4600 journals• 3,307,998,701 web pages on Google• 500,000,000,000,000 bytes on the Web
Some Numbers…
• A researcher would have to scan 130 different journals and read 27 papers per day to follow a single disease, such as breast cancer.
• Baasiri, R.A., Glasser, S.R., Steffen, D.L. & Wheeler, D.A. Oncogene 18, 7958-7965 (1999)
Some Graphs:
Multiplexed CE with Fluorescent detection
ABI 3700 96x700 bases
Genomes• 5 vertebrates (human, mouse, rat, fugu)
• 2 plants (arabadopsis, rice)• 2 insects (fruit fly, mosquito)• 2 nematodes (C. elegans, C. briggsae)• 1 sea squirt• 4 parasites (plasmodium, guillardia)• 4 fungi (S. cerevisae, S. pombe)• 140 bacteria and archebacteria• 1000+ viruses
The Human Genome
• 3.2 billion bases on 24 chromosomes
• 3,201,762,515 bases sequenced (99%)
• 23,531 - 31,609 genes (predicted)
• 50,000+ named genes (synonyms)
• 4000+ human diseases
• 850-1039 disease causing genes (ID’s)
A Tidal Wave of Data
Made worse by….
The Language of Biology
• The EGF receptor binds epidermal growth factor which triggers the phosphorylation of PLC-gamma followed by the binding and subsequent phosphorylation of Grb2 and SOS which leads to the formation of a Raf1-MEK complex which, in turn, leads to a p21ras auto-phosphorylation cascade. The complex then phosphorylates a MAP kinase which is transported to the nucleus via a nuclear transport signal which triggers the transcription of c-Fos, c-Myc and c-Jun which upon release in the rough ER are transported to…
How To Make Sense of This?
• How to acquire biological or medical knowledge from English text?
• How to build facts and relationships from scientific/medical articles?
• How to put 100+ years of useful data into readily accessible electronic repositories (the back fill problem)?
Some Solutions
• Text Mining…
• Create electronic repositories of abstracts and articles (PubMed/Entrez)
• Create glossaries & thesaurus’ of terms• Employ machine learning methods to parse
electronic text to extract or interpret key pieces of “atomic” information (SVM, Naïve Bayes, Reference Point Logistics, etc.)
PubMed
http://www.ncbi.nlm.nih.gov/PubMed/
PubMed• Allows users to search by journal, key
words, titles etc.
• Uses MeSH (Medical SubHeadings) to allow automated search of synonyms (renal transplant = kidney transplantation)
• API available to query PubMed automatically and remotely
• Few users know how to use PubMed properly or to its full extent
“ouellette bf” [au] AND yeast
Details
MeSH: Medical Subject Heading
("ouellette bf"[au] AND (("yeasts"[MeSH Terms] OR "saccharomyces cerevisiae"[MeSH Terms]) OR yeast[Text Word]))
Integrated Text/Sequence Searching with Entrez
PubCrawler
http://www.pubcrawler.ie/
PubCrawler• Free "alerting" service that scans daily
updates to the NCBI Medline (PubMed) and GenBank databases
• Lists new database entries that match search parameters (keywords, author names, etc.) specified by the user
• Results are presented as an HTML Web page (Entrez-like format)
• Can be downloaded or run as a service
MedMiner
http://discover.nci.nih.gov/textmining/filters.html
MedMiner
• A text miner that filters, extracts and organizes relevant sentences in the literature based on a gene, gene-gene or gene-drug query
• Combines GeneCards and PubMed searches with an integrated text filter
• L. Tanabe, U. Scherf, L. H. Smith, J. K. Lee, L. Hunter and J. N. Weinstein, (1999) BioTechniques 27:1210-1217.
MedGene
http://hipseq.med.harvard.edu/MEDGENE/login.jsp
MedGene• A list of human genes associated with a
particular human disease in ranking order • A list of human genes associated with multiple
human diseases in ranking order • A list of human diseases associated with a
particular human gene in ranking order • A list of human genes associated with a
particular human gene in ranking order• The sorted gene list from other disease related
high-throughput experiments, (i.e. micro-array
MedGene Performance
• Was able to identify >2400 genes associated with breast cancer in the literature
• Existing databases only list 260 genes (of which MedGene found 240)
• Could save ~100’s of hours of literature searching & combing
PolySearch
PolySearch
• Searches over 14 million PubMed Records
• Searches against 1622 diseases (and synonyms)
• Searches using 9300 genes with 42,500 synonyms
• Assesses quality using SCI list of impact factors for 8600+ journals
PolySearch• Supports PubMed text searching for gene &
disease associations (user provides disease name)
• Automatically scores & ID’s genes and searches for known SNPs or mutations against std. databases
• Grabs gene sequences and generates primers around SNPs
• Archives (MySQL database) or sends results as HTML page to user
Other Examples of Text or Web Mining
http://textomy.iit.nrc.ca/
Pre-BIND
• Donaldson et al. BMC Bioinformatics 2003 4:11
• Used Support Vector Machine (SVM) to scan literature for protein interactions
• Precision, accuracy and recall of 92% for correctly classifying PI abstracts
• Estimated to capture 60% of all abstracted protein interactions for a given organism
Proteome Analyst
• Uses Naïve Bayes methods in combination with sequence homology to identify “tokens” or nuggets of important information from text (titles, keywords, InterPro numbers and other data)
• Produces quantitative estimates (queryable reliability scores) of protein function, location, etc.
GenePublisher
• Processes raw genechip data and produces a publishable report in 1-2 hours of processor time
• Mines existing databases to build up or extract relationships
• Learns from previous analyses and remembers previous associations
http://www.cbs.dtu.dk/services/GenePublisher/
GenePublisher Output
Continuing Problems in Text Mining Biomedical
Literature are…
A Serious Naming Problem
• Sonic Hedgehog• Draculin• Profilactin• Knobhead• Lunatic Fringe• Fidgetin• Mortalin• Antiquitin• Accelerin
• Cockeye• Clootie Dumpling• SnaFu• Gleeful• Bang Senseless• Bride of Sevenless• Crack• Christmas Factor• Orphanin
And Exotic Terminology…
• J. Med. Genetics 10, 1962-6 (1973) "Mobius Syndrome with Poland’s Anomaly.“
• Heavy use of Eponyms (Werner’s syndrome, Down’s syndrome, Angelman’s syndrome, Creutzfeld-Jacob disease, etc. etc.)
Some Challenges
• How to name or describe proteins, genes, drugs, diseases and conditions consistently and coherently?
• How to ascribe and name a function, process or location consistently?
• How to describe interactions, partners, reactions and complexes?
• How to classify genes & proteins (a universal taxonomy of sequences and structures)?
Some Solutions
• Develop controlled or restricted vocabularies (IUPAC-like naming conventions)
• Create thesaurus’, central repositories or synonym lists (MeSH terms in PubMed)
• Work towards synoptic reporting and structured abstracting
Synoptic or Structured Abstract
J Am Acad Dermatol. 2004 Mar;50(3):431-4. Related Articles, Links
Demand outstrips supply of US pediatric dermatologists: Results from a national survey.
Hester EJ, McNealy KM, Kelloff JN, Diaz PH, Weston WL, Morelli JG, Dellavalle RP.
BACKGROUND: The US pediatric dermatology workforce was last examined in 1986 when limited employment
opportunity was found. OBJECTIVE: We sought to re-examine pediatric dermatology workforce issues. METHODS:
US dermatology chairpersons and residency program directors were surveyed for: (1) agreement with pediatric
dermatology workforce statements; and (2) pediatric dermatology faculty and fellow numbers. RESULTS: Respondents
agreed that having a pediatric dermatologist or dermatologists on faculty is important, and that a shortage of pediatric
dermatologists exists, but did not agree that increasing pediatric dermatology training requirements will increase this
shortage. Almost half of the programs (45/94) employed a full-time pediatric dermatologist, and 24 programs had
currently been recruiting a pediatric dermatologist for more than 1 year. Only 6 pediatric dermatology fellows were
in training. CONCLUSION: Given that open pediatric dermatology faculty positions greatly exceed the number of
fellows in training and that formal training requirements will be increasing, the shortage of pediatric dermatologists
will likely continue.
GO-Gene Ontology
• To produce a controlled vocabulary that changes as biological knowledge changes
• Categorizes according to 1) molecular function; 2) biological process; and 3) cellular component
• Represents contributions and consensus opinions from multiple experts in various fields
• Aim is to have every known protein and gene annotated consistently
http://www.geneontology.org/
NIH’s Medical Ontology Research Program
http://lhncbc.nlm.nih.gov/lhc/servlet/Turbine/template/home%2CHome.vm
MeSH
OMIM
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM
DrugBank
http://redpoll.pharmacy.ualberta.ca/drugbank/
Bioinformatics
Medinformatics
Conquering the Mountain