mining drug targets, structures and activity data

[1]

Mining Drug Targets, Structures and Activity Data Using Open Full-Text

Patent Sources and Web Tools

Christopher Southan

ChrisDS Consulting, Göteborg, Sweden,

Prepared for BioIT, Boston, April 2012,

Track 11, Open Source Solutions, Wednesday, 13:45

[2]

Introduction

[3]

Key Relationships Extractable from Patents and Papers

Document Assay Result Compound Target

MAQALPWLLLWMGAGVLPAHGTQHGIRLPLRSGLGGAPLGLRLPRETDEEPEEPGRRGSFVEMVDNLRGKSGQGYYVEMTVGSPPQTLNILVDTGSSNFAVGAAPHPFLHRYYQRQLSSTYRDLRKGVYVPYTQGKWEGELGTDLVSIPHGPNVTVRANIAAITESDKFFINGSNWEGILGLAYAEIARPDDSLEPFFDSLVKQTHVPNLFSLQLCGAGFPLNQSEVLASVGGSMIIGGIDHSLYTGSLWYTPIRREWYYEVIIVRVEINGQDLKMDCKEYNYDKSIVDSGTTNLRLPKKVFEAAVKSIKAASSTEKFPDGFWLGEQLVCWQAGTTPWNIFPVISLYLMGEVTNQSFRITILPQQYLRPVEDVATSQDDCYKFAISQSSTGTVMGAVIMEGFYVVFDRARKRIGFAVSACHVHDEFRTAAVEGPFVTLDMEDCGYNIPQTDESTLMTIAYVMAAICALFMLPLCLMVCQWRCLRCLRQQHDDFADDISLLK

2011 PMID 21569515

Important ”bag of targets” exceptions (eg bacterial/parasite whole cells)

2010 doi:10.1007/978-3-642-15120-0_9

[4]

The Good News: Patent Mining Utility

• Novel bioactive chemical structures related to drug discovery exceeding those in journals by at least five-fold.

• Encompass academic, as well as commercial, global med. chem. output.• Targets, assays, mechanisms of action, disease descriptions and in-vivo data.• ~ 70% of data initially patent-only, some never disclosed elswhere. • Include synthetic descriptions and other useful enabling information.• Precede journal or meeting reports by ~ 1.5 to 5 years.• Can be complementary to papers (e.g. larger SAR matrix). • Intersect with papers at chemistry, target, disease, author and citation levels• IP exploitable for Neglected Tropical Disease research becoming ”open”.

[5]

The Bad News: Patent Mining Can be Tough • High-specificity retrieval of relevant documents difficult• Massive chaff-to-wheat ratio in 100s of pages• Differences in layout, house style and data location• Markush permutation• Variability in IUPAC strings and image rendering • Use of non-standard gene/protein names• Obfuscation via;

– Qualitative or binned assay results– Structure-to-data links non-obvious, patchy or absent– Less than 50% of titles include target names– The ”hiding the lead and core structures” game– Blunderbuss disease and use exemplifications– Tense ambiguity (i.e. ”could be” vs. ”was” done)

• Quality judgments dificult • Patents cite papers and patents but few papers cite patents• Document redundancy of Kind codes, patent families and equivalents• Finding drug candidate first-filings is difficult• The PDF hamburger problem and OCR noise

[6]

Reasons for Rolling-your-own Patent Chemistry and Data Extraction

• Limited budget• You are likely to be a tacit super-curator by profession• Best-of-both-worlds synergy with licensed sources (e.g. digging deeper)• Combine automated outputs with manual triage• Develop a technical understanding and comparison of vendor offerings• Commercial dbs cap the number of manually-extracted examples • Need SAR analogues for a few targets rather than many (e.g. mechanistic

enzymology or systems chemical biology)• Only require data sampling across specific disease areas• Not overly concerned about false-negatives (i.e. don’t need comprehensive

prior-art check or scoping of claims)• Open tools operate on any text or web source, not just patents• You may already have commercial text mining capability• Flexibility of intersecting patent with literature chemistry (e.g. ChEMBL,

journals you subscribe to, PubMed and PMC)• You can slice-and-dice PubChem patent chemistry in ways complementary

to commercial databases

[7]

Open Sources and Tools Overview • Searching metadata, abstracts and text

– Official public portals: EPO/Espacenet, USPTO, WIPO, EBI CiteExplore – Open full-text: FreePatentsOnline, Google patents, Google Scholar, et al.

• Metadata, full-text and chemical structure search - SureChemOpen • Bulk name-to-structure conversion - ChemAxon Chemicalize• Individulal name-to-structure - OPSIN• Conversion of images to structures - OSRA• Sketcher inputs – many options• Corroborative search in SureChemOpen, PubChem, ChemSpider, Chemicalize• EPO patent number searching in PubChem• PDF24.org for cutting pages and OnlineOCR.net for sections or tables • Utopia bioentity mark-up

(those below not included in this presentation but relevant)• NCI/CADD Chemical Identifier Resolver and Online SMILES Translator• Open cheminformatics tools – CDK, ChemViz, Taverna, OpenBabel etc.• OSCAR/PatentEye, Murray-Rust group, organic-reaction.com Laconde et al,

SCRIPDB, Juristica group

(n.b. Google should give urls for all these source and tool names)

[8]

So What’s in PubChem ?

[9]

PubChem Patent-derived Content ~6 million

• ~ 2.8 million Discovery Gate/Thomson Pharma intersect mainly Derwent WPI pharmaceutical patents plus some journal extractions

• ~ 5.1 million (allpat above) is the union of Thomson/Derwent plus IBM • ~ 3.5 million of these are Lipinski-ROF compliant• ~ 40% journal-extracted structures in ChEMBL have a match in the 5.1 million • ~ 70% of these are Lipinski-ROF compliant• ~ 90% of these have assay data• ~ 60% of the IBM structures (1.5 million) are novel as defined by unique CIDs• ~ 2.3 million SureChem pre-2007 structures also in there (but not selectable)

[10]

Chemistry > Patents in PubChem

[11]

You found a CID, what are the Patent and Journal links?

PubChem > BioAssay/ChEMBL > CiteExplore/PubMed/ > PDB

[12]

Patent Links from SLING and IBM

[13]

PubChem > SureChem > Patent > Stucture > Data > Target

[14]

Target-Centric Patent Searching

[15]

Synonym Recall

• Title only BACE1 = 8• Title + abstract BACE1 = 97• Title + abstract BACE2 = 29• Title + abstract BACE = 392• Title + abstract ”Beta secretase” = 1056• Title + abstract memapsin = 87 • Title + abstract BACE1 OR "Beta secretase" OR BACE OR BACE2 OR

Memapsin = 1383• Title + abstract BACE1 OR "Beta secretase" OR BACE OR BACE2 OR

Memapsin AND inhibitors = 841• Same query to PubMed (this interface) = 1031

[16]

Target Query > Patent Retrieval from Espacenet

[17]

Linking Examples to Data in the Patent

[18]

Extracting Chemical Structrures

[19]

IUPAC-to-structure: OPSIN

Result; Example 31 structure is 24 nM BACE1 inhibitor

Instalable application

Also chemical dictionary conversions

[20]

Image-to-strucuture: OSRA

• Patchy results but fixable by editing and similarity iteration in PubChem• Also an installable application• Useful to cross-check between images and IUPACs

[21]

Follow-up Searching

[22]

Structure Search in PubChemSMILES (via OPSIN, OSRA, SureChem, Chemicalize or sketcher)

Often see stero differences to the Derwent entry in PubChem

[23]

PubChem Similarity ”Walking”

• 2D and 3D different results• Can do multiple steps• Can ”read” CID history • Possible to ”walk” between patents • Look for links to ChEMBL, BioAssay, PubMed, chemical suppliers etc.

[24]

Direct Patent <> Chemistry

[25]

SureChemOpen: Patent Retrieval

• Patent searching, chemistry-to-patent and patent-to-chemistry in one portal• Higher rate of name-to-structure conversion than Chemicalize or OPSIN (but not

bulk export)

[26]

SurChemOpen, WIPO, OPSIN and PubChem

Result 1nm (?) BACE2 inhibitor with assay and synthesis details.

[27]

SureChemOpen: Structure > Patent

Direct answers to: ”which patents contain compounds simiar to my query” and ”show me all the compounds in these patents”

[28]

Non-target Activity Data and Bulk Chemistry Extraction

[29]

Malaria Query: CiteExplore > WIPO

Example 60, sub-200nM potency, with solubilty and clearance data

[30]

Espacenet EP2391601 > ChemAxon Chemicalize.org

• Description URL from Espacenet pasted into Chemcalize.org

• Most of 74 examples converted

• Example 60 had 4 analgues in PubChem at 95% Tamimoto (e.g. CID 46852300) but no exact match

• Claims section was Markush description so no relevant structures converted

[31]

EP2391601 > Chemicalize > PubChem

• EP2391601 description text > Chemicalize SDF download > PubChem Structure Search upload = 311 structures

• Of these 206 have PubChem exact matches • Of these 176 have Thomson Pharma matches• The example cluster (Thomson/Derwent extraction) cluster is ~15• The example cluster from Chemicalize is ~ 90 • Ipso facto Chemicalize extracted at least 70 novel structures• But only 10 examples were in the highest-potency bin

Chemicalize Similarity listing PubChem Tanimoto sub-cluster

[32]

Tips and Tricks

[33]

Tables and Recalcitrant IUPACs

PDF

Find tables

Snip image

Online OCR

Word Pad

Chemicalize

OPSIN

OSRA

• iterative fixing of OCR errors (e.g. 1 vs l)

• cross-check Mw in the document

[34]

Utopia Mark-up of Patent Introduction

Bioentity mark-up (green) via EMBL Reflect with rich call-out options

[35]

Tips for Joining Everything up • SureChemOpen is continuing to back-fill and add features.• Check the Chemicalize archive (~ 0.5 million) for unique content.• Between Chemicalize, OSRA, OPSIN and sketching you can extract most things

(e.g. journal or meeting abstracts, PubMed Central full-text, catalogues, wiki pages, blog posts and MeSH IUPACs).

• Check PubChem ”same connectivity” for tautomer forms in different CIDs.• Check PubChem ”similar” compounds for analogues even if you cannot track

back to a patent number.• Most PDB ligands published by companies have a patent analogue series.• Espacenet text chemicalizes well but FreePantentsOnline can be better.• Google Scholar tracks patent citations.• Full-text is good but don’t forget to eyeball the original PDF• You can ”walk” between patents by 2D/3D clusters, inventors or citations.• Less-common author/inventor names may track a journal paper back to a patent. • CiteExplore includes selectable ChEMBL structure links.• Check ChEMBL structures for SureChem links via ChemSpider.• On a good day you can paste OCR table data into Excel.• You can set SciBitely patent keyword alerts and see posts on Twitter.

[36]

Conclusions

• Roll-your-own patent mining can take you a long way.• Complementary to commerical databases.• Target-centric recall and specificity is reasonable.• Published patents are indexed and open text-extracted within weeks.• You need perspicacity to dig out SAR details.• Can cherry pick examples by potency or collate whole series• Establishing intersects between journal articles and patents is valuable.• Exemplified structures typically cover a broader range of analogue space

and SAR data than papers.• You can ”walk” between patents via citation and chemistry clustering.• PubChem already contains over 6 million patent-derived structures with

more depositions and links expected.• The increased public surfacing of chemical structres and bioactivity data

from patents will expedite medicinal chemistry, tropical disease research and chemical biology.

[37]

Questions Welcome

ChrisDS Consulting: http://www.cdsouthan.info/Consult/CDS_cons.htmMobile: +46(0)702-530710Skype: cdsouthanEmail: cdsouthan – at - hotmail.comTwitter: http://twitter.com/#!/cdsouthanBlog: http://cdsouthan.blogspot.com/ (includes postings on patent themes)LinkedIN: http://www.linkedin.com/in/cdsouthanWebsite: http://www.cdsouthan.info/CDS_prof.htmPublications: http://www.citeulike.org/user/cdsouthan/publications/order/yearCitations:http://scholar.google.com/citations?user=y1DsHJ8AAAAJ&hl=enPresentations: http://www.slideshare.net/cdsouthan

http://www.cdsouthan.info/Consult/CDS_cons.htm

http://www.cdsouthan.info/Consult/CDS_cons.htm

http://twitter.com/

http://twitter.com/

http://cdsouthan.blogspot.com/

http://cdsouthan.blogspot.com/

http://www.linkedin.com/in/cdsouthan

http://www.linkedin.com/in/cdsouthan

http://www.cdsouthan.info/CDS_prof.htm

http://www.cdsouthan.info/CDS_prof.htm

http://www.slideshare.net/cdsouthan

http://www.slideshare.net/cdsouthan

mining drug targets, structures and activity data

Technology

text patent sources

patent families

dice pubchem patent

open sources

google patents

data sampling

activity data

ibm structures