cinf 18: wikipedia and wiktionary as resources for chemical text mining

19
Wikipedia and Wiktionary as Resources for Chemical Text Mining Roger Sayle and Daniel Lowe NextMove Software, Cambridge, UK 250 th ACS National Meeting, Boston, MA. Sunday 16 th August 2015

Upload: nextmove-software

Post on 11-Feb-2017

856 views

Category:

Science


0 download

TRANSCRIPT

Page 1: CINF 18: Wikipedia and Wiktionary as resources for chemical text mining

Wikipedia and Wiktionary as Resources for Chemical Text Mining

Roger Sayle and Daniel Lowe

NextMove Software, Cambridge, UK

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 2: CINF 18: Wikipedia and Wiktionary as resources for chemical text mining

Author’s preface

• This talk is a personal note of appreciation for the incredible work done by the volunteers of the wikipedia community and the wikimedia foundation.

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 3: CINF 18: Wikipedia and Wiktionary as resources for chemical text mining

Example #1: drug name benchmark

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Tradename Count Scientific Name Count Ratio

Levitra 474 Vardenafil 2610 5.51

Celebrex 5952 Celecoxib 16990 2.85

Aricept 2466 Donepezil 7164 2.91

Aldara 550 Imiquimod 4889 8.89

Cozaar 882 Losartan 9444 10.71

Benicar 176 Olmesartan medoxomil 2132 12.11

Detrol 579 Tolterodine 2812 4.86

Lescol 2261 Fluvastatin 14067 6.22

Casodex 2549 Bicalutamide 10979 4.31

Tarceva 3480 Erlotinib 5441 1.56

Patanol 124 Olopatadine 1223 9.86

Cialis 3562 Tadalafil 2105 0.59

Diovan 871 Valsartan 7196 8.26

Avapro 662 Irbesartan 6633 10.02

Flovent 484 Fluticasone propionate 11468 23.69

Bextra 1591 Valdecoxib 7450 4.68

Aciphex 331 Rabeprazole 2522 7.62

Lamisil 463 Terbinafine 6031 13.03

Vigamox 70 Moxifloxacin 2508 35.83

Femara 2044 Letrozole 8228 4.03

Zomig 221 Zolmitriptan 2840 12.85

Zofran 1021 Odansetron 1270 1.24

Spiriva 404 Tiotropium 5018 12.42

Coreg 639 Carvedilol 6050 9.47

Atacand 614 Candesartan 6408 10.44

Paxil 2027 Paroxetine 10981 5.42

Nasonex 214 Mometasone furoate 8275 38.67

Sustiva 2912 Efavirenz 6746 2.32

Arimidex 2376 Anastrozole 9973 4.2

Reyataz 558 Atazanavir 2084 3.73

Avg 9.28

In 2011, I published a paper in collaboration with AstraZenca on drug patents. One analysis was on the scientific names and tradenames of drugs based on Hattori et al. 2008, which became a name-to-structure benchmark

Page 4: CINF 18: Wikipedia and Wiktionary as resources for chemical text mining

Example #1: drug name benchmark

• Various commercial name-to-structure software and databases were evaluated with varying degrees of precision and recall on this dataset.

• Surprisingly, the best performing, at the time, was a short python script from Pat Walters at Vertex, that looked up the name on Wikipedia, tested for the presence of a Chembox containing a PubChem CID, and then retrieved the SMILES from PubChem web server at the NCBI!

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 5: CINF 18: Wikipedia and Wiktionary as resources for chemical text mining

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Boxes and templates and categories, oh MY!

Page 6: CINF 18: Wikipedia and Wiktionary as resources for chemical text mining

Example #2: drug-repurposing

• One early application of NextMove’s LeadMine text mining software was an investigation into drug repurposing at the University of New Mexico.

• Thesis: One drug’s side-effects are an other drugs therapeutic intervention.

• An initial proof-of-concept was “dry mouth” with applications to dental/oral surgery.

• FDA approved drugs that cause dry mouth would reduce the clinical trials/paperwork to market.

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 7: CINF 18: Wikipedia and Wiktionary as resources for chemical text mining

Example #2: drug-repurposing

• The problem is that neither the ICD-9/ICD-10 databases nor the Human Disease Ontology list “dry mouth” as an entry or an entry synonym.

• Wikipedia on the other hand “knows everything” and the page on “dry mouth” contains a Diseasebox with cross references to ICD.

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 8: CINF 18: Wikipedia and Wiktionary as resources for chemical text mining

Example #3: multilingual support

• Daniel will be presenting the poster “Chemistry Enabling Chinese, Japanese and Korean Patents”, at SciMix tomorrow evening.

• A fantastic resource for non-English technical nomenclature are the wikipedia translations.

– 甲烷 methane

– 乙烷 ethane

– 丙烷 propane

– 二环[2.2.2]辛-2-烯 bicyclo[2.2.2]oct-2-ene

– 1,4-二氢-4-吡啶亚基 1,4-dihydro-4-pyridylidene

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 9: CINF 18: Wikipedia and Wiktionary as resources for chemical text mining

Example #4: named reactions

• NextMove Software’s NameRxn tool is used to name, categorize, chemical reactions, such as those recorded in a pharmaceutical company ELN.

• A fantastic resource in achieving this is the Royal Society of Chemistry’s RXNO ontology.

• Alas although a number of Wikipedia pages on reactions contain RXNO, these boxes have been removed from several (potentially due to competing academic or financial interests).

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 10: CINF 18: Wikipedia and Wiktionary as resources for chemical text mining

Example #5: parts of speech

• Many term dictionaries and ontologies used in text mining contain only singular nouns.

• Wiktionary is a useful resource for part of speech.

• Plurals

– octopuses

• Adjective forms

– oral, cervical, renal, hepatic

– hypertensive, demented

• Alas consistency of annotation “(anatomy)” is poor.

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 11: CINF 18: Wikipedia and Wiktionary as resources for chemical text mining

Biocreative V competition

• This week is the BioCreative 5 community wide assessment of chemical and biological test mining.

• Tracks for annotaiting chemicals in patents and for drug-disease associations in PubMed.

• NextMove Software makes use of wikipedia to perform synonymous term expansion of diseases in the MeSH ontology.

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 12: CINF 18: Wikipedia and Wiktionary as resources for chemical text mining

RESOlVED SYNONYM EXPANSION

• Methodology 1

– MeSH IDs corresponding to diseases (MeSH tree C) and mental disorders (MeSH tree F03) determined.

– Find Wikipedia pages with disease/symptom boxes that contain one of these MeSH IDs.

– Associate the page title and all redirects with that MeSH ID

• Methodology 2

– Find pages whose name matches a MeSH synonym

– Associate all redirects to that page with the MeSH ID of the aforementioned synonym

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 13: CINF 18: Wikipedia and Wiktionary as resources for chemical text mining

Resultant dictionary

• 31,699 disease name/MeSH ID relationships extracted.

• 20,611 not present in our pre-existing MeSH/Human Disease Ontology derived dictionary.

• Allowed linking of some terms in HumanDO lacking links to MeSH to MeSH IDs.

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 14: CINF 18: Wikipedia and Wiktionary as resources for chemical text mining

Results (mentions) (BioCreative V CDR training + development set)

Type Precision Recall F-measure Wikipedia 0.8301 0.6425 0.7244 MeSH +

HumanDO 0.9038 0.6548 0.7594

MeSH + HumanDO + Wikipedia

0.8622 0.7256 0.788

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 15: CINF 18: Wikipedia and Wiktionary as resources for chemical text mining

Results (concepts) (BioCreative V CDR training + development set)

Type Precision Recall F-measure Wikipedia 0.7932 0.6128 0.6914 MeSH +

HumanDO 0.9155 0.6705 0.7741

MeSH + HumanDO + Wikipedia

0.851 0.7308 0.7863

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 16: CINF 18: Wikipedia and Wiktionary as resources for chemical text mining

Semantic Issues

• Redirects are not semantic, rather than being a synonym can be a related concept e.g. – Treatment of the disease

– Detection of the disease

– Particular outbreak of the disease

• Redirecting to a section of a page can be a related concept, but can also be a sub-type of the disease.

• Difference in classification granularity e.g. heart disease redirects to the page on Cardiovascular disease.

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 17: CINF 18: Wikipedia and Wiktionary as resources for chemical text mining

Garbage in, Garbage out

• Original dictionary had “gambling” (MeSH has the same concept ID for pathological gambling (mental disorder) and gambling (specific instance of risk-taking behaviour).

• Hence Wikipedia allowed all terms related to gambling to be retrieved E.g. gambler, gamble, gambling den…

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 18: CINF 18: Wikipedia and Wiktionary as resources for chemical text mining

summary

• The resources provided by wikipedia and wiktionary can supplement more traditional lexicons and ontologies.

• Although mostly unstructured, the use of boxes, categories, templates and redirects adds significant value.

• Fingers-crossed the advantages provided by using Wikipedia augmented disease dictionaries is sufficient for LeadMine to do well at BioCreative.

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 19: CINF 18: Wikipedia and Wiktionary as resources for chemical text mining

acknowledgements

• The Wikimedia Foundation

• Pat Walters, Vertex Pharmaceuticals, Boston, MA.

• Jeremy Yang, UNM, Albuquerque, NM.

• The rest of the team at NextMove Software

– Noel O’Boyle

– John May

• Many thanks for your time.

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015