extract chemical information from patents using

Post on 25-Dec-2021

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Extract Chemical Information from Patents

Using Chemicalize and D2S

(Document to Structure)

Wei Deng (David), Daniel Bonniot, Andras Stracz

International Conference and Exhibition on

Computer Aided Drug Design & QSAR

Oct 30th, 2012

Chicago, IL, USA

ChemAxon’s Naming Technology

● Structure to Name

● Name to Structure

● Document to Structure

● Document to Database

● Chemicalize

2

ChemAxon’s Naming Technology

• Structure to Name

– IUPAC Name, traditional names

– Reaching maturity

– Still upcoming: peptides, some fused systems

• Name to structure

– IUPAC, CAS and systematic names

– Dictionary of common names and drug names

– Support CAS Registry number (webservice)

– Homology group: alkyl, aryl …

– Future: Biological names, polymers, ...

• Accuracy and coverage constantly improving

• Also available from command-line

3

Name to Structure Internals

• Dictionary of common and drug names

– Uses nine different source dictionaries

– Harmonized using weighted consensus method

– 150K names for 40K unique structures

– Custom dictionary and plug-in system

• Systematic names

– Proprietary algorithm

– ”Understands” systematic names

– Example:

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

4

Systematic Name Example

5

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Systematic Name: Parsing

6

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Systematic Name: Parsing

7

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Systematic Name: Parsing

8

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Systematic Name: Parsing

9

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Systematic Name: Parsing

1

0

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Systematic Name: Parsing

1

1

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Systematic Name: Structure Generation

1

2

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

OCR Error Correction

1

3

(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate

OCR Error Correction

1

4

(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

OCR Error Correction

1

5

(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Λr-benzyl-Λr-[3-(lH-tetrazol-5-

yl)phenyl]propanamide

?-benzyl-?-[3-(?H-tetrazol-5-

yl)phenyl]propanamide

OCR Error Correction

1

6

(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Λr-benzyl-Λr-[3-(lH-tetrazol-5-

yl)phenyl]propanamide

N-benzyl-N-[3-(1H-tetrazol-5-

yl)phenyl]propanamide

ChemAxon’s “Document to Structure”

• Extract chemical information from

documents – Names: powered by the Naming Technology

– Also import smiles, InChI, CAS number …

– Images: OSRA …

– Works with scanned non-searchable PDF

– Returns structures and their location in the document

• Supported formats: – MS Office document: doc, docx, ppt, pptx, xls, xlsx, odt …

– Embedded structure objects (ChemDraw, Symyx, Marvin,

…)

– PDF, text, XML, HTML

17

From Document to Structures

1

8

Non-searchable patent (50 pages) Structure (text + image) + location

Instant JChem Example

• In Instant JChem, search for

19

Search by Structure or Text

20

Non-searchable PDF is now Searchable

21

“Document to Database”

22

ChemAxon’s “Document to Database”

• Data in DB:

– Structures

– Source (name, smiles, embedded, …) and

location

– Documents, Authors, Metadata...

• Questions:

– What structures appear in a specific document?

– What documents contain a

structure/substructure/...?

– What documents written since 2010 in location X

contain substructure S?

... 23

ChemAxon’s “Document to Database”

• Customizable:

– Metadata extracted from documents

– Interface (IJC forms, webapp)

• Demo:

– One month of US patents

– 85K unique structures from systematic names

– 1M occurrences

24

Summary

● Extensive, improving naming technologies

(n2s, s2n)

● Increasing support for Document mining

(d2s, d2db, SharePoint)

● Putting it all together → going large scale,

extract and use valuable chemical

information

● Still component-based and responding to

our users requests

2

5

Free Online Service Chemicalize.org

• Extract

• Interactively display

• Calculate

• Search

26

Recently reviewed in J. Chem. Inf. Model., 2012, 52 (2), pp 613–

615

27

Webpage - Chemicalized

• Customizable report layout for calculation results. Users can move, open, close, expand calculation boxes and this is remembered on the next visit

Data Page: Extensive Predicted Properties

29

Webpage - Chemicalized

PDF File - Chemicalized

Aspirin: query highlighted in results

Searching Chemicalize.org – Structure Search

• Aspirin; web page hits - “show” related structures

• Autosuggest while typing

Searching Chemicalize.org – Keyword Search

Availability and Customization

• Source code available

• Minor changes required on example codes

for customization, such as

– Import extracted structures to other databases

– Post-process filtering according to properties

– Batch process of multiple documents

34

top related