mining drug targets, structures and activity data

37
[1] Mining Drug Targets, Structures and Activity Data Using Open Full-Text Patent Sources and Web Tools Christopher Southan ChrisDS Consulting, Göteborg, Sweden, Prepared for BioIT, Boston, April 2012, Track 11, Open Source Solutions, Wednesday, 13:45

Upload: chris-southan

Post on 11-May-2015

1.099 views

Category:

Technology


0 download

DESCRIPTION

Presenation at BioIT 2012

TRANSCRIPT

Page 1: Mining Drug Targets, Structures and Activity Data

[1]

Mining Drug Targets, Structures and Activity Data Using Open Full-Text

Patent Sources and Web Tools

Christopher Southan

ChrisDS Consulting, Göteborg, Sweden,

Prepared for BioIT, Boston, April 2012,

Track 11, Open Source Solutions, Wednesday, 13:45

Page 2: Mining Drug Targets, Structures and Activity Data

[2]

Introduction

Page 3: Mining Drug Targets, Structures and Activity Data

[3]

Key Relationships Extractable from Patents and Papers

Document Assay Result Compound Target

MAQALPWLLLWMGAGVLPAHGTQHGIRLPLRSGLGGAPLGLRLPRETDEEPEEPGRRGSFVEMVDNLRGKSGQGYYVEMTVGSPPQTLNILVDTGSSNFAVGAAPHPFLHRYYQRQLSSTYRDLRKGVYVPYTQGKWEGELGTDLVSIPHGPNVTVRANIAAITESDKFFINGSNWEGILGLAYAEIARPDDSLEPFFDSLVKQTHVPNLFSLQLCGAGFPLNQSEVLASVGGSMIIGGIDHSLYTGSLWYTPIRREWYYEVIIVRVEINGQDLKMDCKEYNYDKSIVDSGTTNLRLPKKVFEAAVKSIKAASSTEKFPDGFWLGEQLVCWQAGTTPWNIFPVISLYLMGEVTNQSFRITILPQQYLRPVEDVATSQDDCYKFAISQSSTGTVMGAVIMEGFYVVFDRARKRIGFAVSACHVHDEFRTAAVEGPFVTLDMEDCGYNIPQTDESTLMTIAYVMAAICALFMLPLCLMVCQWRCLRCLRQQHDDFADDISLLK

2011 PMID 21569515

Important ”bag of targets” exceptions (eg bacterial/parasite whole cells)

2010 doi:10.1007/978-3-642-15120-0_9

Page 4: Mining Drug Targets, Structures and Activity Data

[4]

The Good News: Patent Mining Utility

• Novel bioactive chemical structures related to drug discovery exceeding those in journals by at least five-fold.

• Encompass academic, as well as commercial, global med. chem. output.• Targets, assays, mechanisms of action, disease descriptions and in-vivo data.• ~ 70% of data initially patent-only, some never disclosed elswhere. • Include synthetic descriptions and other useful enabling information.• Precede journal or meeting reports by ~ 1.5 to 5 years.• Can be complementary to papers (e.g. larger SAR matrix). • Intersect with papers at chemistry, target, disease, author and citation levels• IP exploitable for Neglected Tropical Disease research becoming ”open”.

Page 5: Mining Drug Targets, Structures and Activity Data

[5]

The Bad News: Patent Mining Can be Tough • High-specificity retrieval of relevant documents difficult• Massive chaff-to-wheat ratio in 100s of pages• Differences in layout, house style and data location• Markush permutation• Variability in IUPAC strings and image rendering • Use of non-standard gene/protein names• Obfuscation via;

– Qualitative or binned assay results– Structure-to-data links non-obvious, patchy or absent– Less than 50% of titles include target names– The ”hiding the lead and core structures” game– Blunderbuss disease and use exemplifications– Tense ambiguity (i.e. ”could be” vs. ”was” done)

• Quality judgments dificult • Patents cite papers and patents but few papers cite patents• Document redundancy of Kind codes, patent families and equivalents• Finding drug candidate first-filings is difficult• The PDF hamburger problem and OCR noise

Page 6: Mining Drug Targets, Structures and Activity Data

[6]

Reasons for Rolling-your-own Patent Chemistry and Data Extraction

• Limited budget• You are likely to be a tacit super-curator by profession• Best-of-both-worlds synergy with licensed sources (e.g. digging deeper)• Combine automated outputs with manual triage• Develop a technical understanding and comparison of vendor offerings• Commercial dbs cap the number of manually-extracted examples • Need SAR analogues for a few targets rather than many (e.g. mechanistic

enzymology or systems chemical biology)• Only require data sampling across specific disease areas• Not overly concerned about false-negatives (i.e. don’t need comprehensive

prior-art check or scoping of claims)• Open tools operate on any text or web source, not just patents• You may already have commercial text mining capability• Flexibility of intersecting patent with literature chemistry (e.g. ChEMBL,

journals you subscribe to, PubMed and PMC)• You can slice-and-dice PubChem patent chemistry in ways complementary

to commercial databases

Page 7: Mining Drug Targets, Structures and Activity Data

[7]

Open Sources and Tools Overview • Searching metadata, abstracts and text

– Official public portals: EPO/Espacenet, USPTO, WIPO, EBI CiteExplore – Open full-text: FreePatentsOnline, Google patents, Google Scholar, et al.

• Metadata, full-text and chemical structure search - SureChemOpen • Bulk name-to-structure conversion - ChemAxon Chemicalize• Individulal name-to-structure - OPSIN• Conversion of images to structures - OSRA• Sketcher inputs – many options• Corroborative search in SureChemOpen, PubChem, ChemSpider, Chemicalize• EPO patent number searching in PubChem• PDF24.org for cutting pages and OnlineOCR.net for sections or tables • Utopia bioentity mark-up

(those below not included in this presentation but relevant)• NCI/CADD Chemical Identifier Resolver and Online SMILES Translator• Open cheminformatics tools – CDK, ChemViz, Taverna, OpenBabel etc.• OSCAR/PatentEye, Murray-Rust group, organic-reaction.com Laconde et al,

SCRIPDB, Juristica group

(n.b. Google should give urls for all these source and tool names)

Page 8: Mining Drug Targets, Structures and Activity Data

[8]

So What’s in PubChem ?

Page 9: Mining Drug Targets, Structures and Activity Data

[9]

PubChem Patent-derived Content ~6 million

• ~ 2.8 million Discovery Gate/Thomson Pharma intersect mainly Derwent WPI pharmaceutical patents plus some journal extractions

• ~ 5.1 million (allpat above) is the union of Thomson/Derwent plus IBM • ~ 3.5 million of these are Lipinski-ROF compliant• ~ 40% journal-extracted structures in ChEMBL have a match in the 5.1 million • ~ 70% of these are Lipinski-ROF compliant• ~ 90% of these have assay data• ~ 60% of the IBM structures (1.5 million) are novel as defined by unique CIDs• ~ 2.3 million SureChem pre-2007 structures also in there (but not selectable)

Page 10: Mining Drug Targets, Structures and Activity Data

[10]

Chemistry > Patents in PubChem

Page 11: Mining Drug Targets, Structures and Activity Data

[11]

You found a CID, what are the Patent and Journal links?

PubChem > BioAssay/ChEMBL > CiteExplore/PubMed/ > PDB

Page 12: Mining Drug Targets, Structures and Activity Data

[12]

Patent Links from SLING and IBM

Page 13: Mining Drug Targets, Structures and Activity Data

[13]

PubChem > SureChem > Patent > Stucture > Data > Target

Page 14: Mining Drug Targets, Structures and Activity Data

[14]

Target-Centric Patent Searching

Page 15: Mining Drug Targets, Structures and Activity Data

[15]

Synonym Recall

• Title only BACE1 = 8• Title + abstract BACE1 = 97• Title + abstract BACE2 = 29• Title + abstract BACE = 392• Title + abstract ”Beta secretase” = 1056• Title + abstract memapsin = 87 • Title + abstract BACE1 OR "Beta secretase" OR BACE OR BACE2 OR

Memapsin = 1383• Title + abstract BACE1 OR "Beta secretase" OR BACE OR BACE2 OR

Memapsin AND inhibitors = 841• Same query to PubMed (this interface) = 1031

Page 16: Mining Drug Targets, Structures and Activity Data

[16]

Target Query > Patent Retrieval from Espacenet

Page 17: Mining Drug Targets, Structures and Activity Data

[17]

Linking Examples to Data in the Patent

Page 18: Mining Drug Targets, Structures and Activity Data

[18]

Extracting Chemical Structrures

Page 19: Mining Drug Targets, Structures and Activity Data

[19]

IUPAC-to-structure: OPSIN

Result; Example 31 structure is 24 nM BACE1 inhibitor

Instalable application

Also chemical dictionary conversions

Page 20: Mining Drug Targets, Structures and Activity Data

[20]

Image-to-strucuture: OSRA

• Patchy results but fixable by editing and similarity iteration in PubChem• Also an installable application• Useful to cross-check between images and IUPACs

Page 21: Mining Drug Targets, Structures and Activity Data

[21]

Follow-up Searching

Page 22: Mining Drug Targets, Structures and Activity Data

[22]

Structure Search in PubChemSMILES (via OPSIN, OSRA, SureChem, Chemicalize or sketcher)

Often see stero differences to the Derwent entry in PubChem

Page 23: Mining Drug Targets, Structures and Activity Data

[23]

PubChem Similarity ”Walking”

• 2D and 3D different results• Can do multiple steps• Can ”read” CID history • Possible to ”walk” between patents • Look for links to ChEMBL, BioAssay, PubMed, chemical suppliers etc.

Page 24: Mining Drug Targets, Structures and Activity Data

[24]

Direct Patent <> Chemistry

Page 25: Mining Drug Targets, Structures and Activity Data

[25]

SureChemOpen: Patent Retrieval

• Patent searching, chemistry-to-patent and patent-to-chemistry in one portal• Higher rate of name-to-structure conversion than Chemicalize or OPSIN (but not

bulk export)

Page 26: Mining Drug Targets, Structures and Activity Data

[26]

SurChemOpen, WIPO, OPSIN and PubChem

Result 1nm (?) BACE2 inhibitor with assay and synthesis details.

Page 27: Mining Drug Targets, Structures and Activity Data

[27]

SureChemOpen: Structure > Patent

Direct answers to: ”which patents contain compounds simiar to my query” and ”show me all the compounds in these patents”

Page 28: Mining Drug Targets, Structures and Activity Data

[28]

Non-target Activity Data and Bulk Chemistry Extraction

Page 29: Mining Drug Targets, Structures and Activity Data

[29]

Malaria Query: CiteExplore > WIPO

Example 60, sub-200nM potency, with solubilty and clearance data

Page 30: Mining Drug Targets, Structures and Activity Data

[30]

Espacenet EP2391601 > ChemAxon Chemicalize.org

• Description URL from Espacenet pasted into Chemcalize.org

• Most of 74 examples converted

• Example 60 had 4 analgues in PubChem at 95% Tamimoto (e.g. CID 46852300) but no exact match

• Claims section was Markush description so no relevant structures converted

Page 31: Mining Drug Targets, Structures and Activity Data

[31]

EP2391601 > Chemicalize > PubChem

• EP2391601 description text > Chemicalize SDF download > PubChem Structure Search upload = 311 structures

• Of these 206 have PubChem exact matches • Of these 176 have Thomson Pharma matches• The example cluster (Thomson/Derwent extraction) cluster is ~15• The example cluster from Chemicalize is ~ 90 • Ipso facto Chemicalize extracted at least 70 novel structures• But only 10 examples were in the highest-potency bin

Chemicalize Similarity listing PubChem Tanimoto sub-cluster

Page 32: Mining Drug Targets, Structures and Activity Data

[32]

Tips and Tricks

Page 33: Mining Drug Targets, Structures and Activity Data

[33]

Tables and Recalcitrant IUPACs

PDF

Find tables

Snip image

Online OCR

Word Pad

Chemicalize

OPSIN

OSRA

• iterative fixing of OCR errors (e.g. 1 vs l)

• cross-check Mw in the document

Page 34: Mining Drug Targets, Structures and Activity Data

[34]

Utopia Mark-up of Patent Introduction

Bioentity mark-up (green) via EMBL Reflect with rich call-out options

Page 35: Mining Drug Targets, Structures and Activity Data

[35]

Tips for Joining Everything up • SureChemOpen is continuing to back-fill and add features.• Check the Chemicalize archive (~ 0.5 million) for unique content.• Between Chemicalize, OSRA, OPSIN and sketching you can extract most things

(e.g. journal or meeting abstracts, PubMed Central full-text, catalogues, wiki pages, blog posts and MeSH IUPACs).

• Check PubChem ”same connectivity” for tautomer forms in different CIDs.• Check PubChem ”similar” compounds for analogues even if you cannot track

back to a patent number.• Most PDB ligands published by companies have a patent analogue series.• Espacenet text chemicalizes well but FreePantentsOnline can be better.• Google Scholar tracks patent citations.• Full-text is good but don’t forget to eyeball the original PDF• You can ”walk” between patents by 2D/3D clusters, inventors or citations.• Less-common author/inventor names may track a journal paper back to a patent. • CiteExplore includes selectable ChEMBL structure links.• Check ChEMBL structures for SureChem links via ChemSpider.• On a good day you can paste OCR table data into Excel.• You can set SciBitely patent keyword alerts and see posts on Twitter.

Page 36: Mining Drug Targets, Structures and Activity Data

[36]

Conclusions

• Roll-your-own patent mining can take you a long way.• Complementary to commerical databases.• Target-centric recall and specificity is reasonable.• Published patents are indexed and open text-extracted within weeks.• You need perspicacity to dig out SAR details.• Can cherry pick examples by potency or collate whole series• Establishing intersects between journal articles and patents is valuable.• Exemplified structures typically cover a broader range of analogue space

and SAR data than papers.• You can ”walk” between patents via citation and chemistry clustering.• PubChem already contains over 6 million patent-derived structures with

more depositions and links expected.• The increased public surfacing of chemical structres and bioactivity data

from patents will expedite medicinal chemistry, tropical disease research and chemical biology.

Page 37: Mining Drug Targets, Structures and Activity Data

[37]

Questions Welcome

ChrisDS Consulting: http://www.cdsouthan.info/Consult/CDS_cons.htmMobile: +46(0)702-530710Skype: cdsouthanEmail: cdsouthan – at - hotmail.comTwitter: http://twitter.com/#!/cdsouthanBlog: http://cdsouthan.blogspot.com/ (includes postings on patent themes)LinkedIN: http://www.linkedin.com/in/cdsouthanWebsite: http://www.cdsouthan.info/CDS_prof.htmPublications: http://www.citeulike.org/user/cdsouthan/publications/order/yearCitations:http://scholar.google.com/citations?user=y1DsHJ8AAAAJ&hl=enPresentations: http://www.slideshare.net/cdsouthan