new software developments on chemical information extraction

17
New Software Developments on Chemical Information Extraction Wei Deng (David)

Upload: others

Post on 25-Dec-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: New Software Developments on Chemical Information Extraction

New Software Developments on

Chemical Information Extraction

Wei Deng (David)

Page 2: New Software Developments on Chemical Information Extraction

ChemAxon’s Naming Technology

• Name to structure

– IUPAC, traditional and common names

– A common name library of existing drugs

– Support CAS Registry number

– Homology group: alkyl, aryl …

– Future: Biological names (PDB code, EC # …)

• Structure to Name

– IUPAC Name, traditional names, common names

– Support other structure features • Isotopes, pseudo-asymmetric stereocenters …

• Accuracy and coverage constantly improving

• Also available from command-line

2

Page 3: New Software Developments on Chemical Information Extraction

ChemAxon’s “Document to Structure”

• Extract chemical information from documents – Names: powered by the Naming Technology

– Also import SMILES, InChI, CAS number …

– Images: OSRA

– Returns structure and their location in the document

• Works with scanned PDF since 5.8 (Feb 2012)

– Great for patent mining

• OCR and syntax correction constantly developed

– 3-rnethyl-l-me- thoxynaphthalene

– 3-methyl-1-methoxynaphthalene

3

Page 4: New Software Developments on Chemical Information Extraction

ChemAxon’s “Document to Structure”

• New Features in 5.9 (Mar 2012)

– MS Office document: doc, docx, ppt, pptx, xls, xlsx, odt …

– Embedded structure objects (ChemDraw, Symyx, Marvin

…)

– Progressively display result

– Speed improvement

– Instant JChem Integration; Simplfied API

• Currently in development for 5.10 (May

2012) – OSRA “Confidence”

– Fragment groups integration with Markush generation

– Collaboration with Linguamatics

– IJC (OSRA, Location) 4

Page 5: New Software Developments on Chemical Information Extraction

From Document to Structures

5

Non-searchable patent (50 pages) Structure (text + image) + location

Page 6: New Software Developments on Chemical Information Extraction

Search by Structure or Text

6

Page 7: New Software Developments on Chemical Information Extraction

Non-searchable PDF is now Searchable

7

Page 8: New Software Developments on Chemical Information Extraction

Free Online Service Chemicalize.org

• Extract chemical information from web pages and PDF documents

• Interactively display all structures and their predicted properties

• Search all structures extracted

• Gather links of interest to chemists for post processing (search,

analysis, reporting, fun…)

• Recently reviewed on Journal of Chemical Information and

Modeling

8

Page 9: New Software Developments on Chemical Information Extraction

9

Webpage - chemicalized

• All chemical names are highlighted with dotted line

• Mouse over a name pops up the structure image

• Click on the image will direct to the data page

• Links are “respected”

Page 10: New Software Developments on Chemical Information Extraction

• Customizable report layout for calculation results. Users can move, open, close, expand calculation boxes and this is remembered on the next visit

Data Page: Extensive Predicted Properties

Page 11: New Software Developments on Chemical Information Extraction

11

• All structures are summarized above the chemicalized page

• Click on a structure to highlight all occurrences. Click again to

navigate to the next occurrence

• All structures can be downloaded as MRV or SDF

Webpage - chemicalized

Page 12: New Software Developments on Chemical Information Extraction

PDF File - chemicalized

Page 13: New Software Developments on Chemical Information Extraction

Aspirin: query highlighted in results

Searching Chemicalize.org – Structure Search

Page 14: New Software Developments on Chemical Information Extraction

• Aspirin; web page hits - “show” related structures

• Autosuggest while typing

Searching Chemicalize.org – Keyword Search

Page 15: New Software Developments on Chemical Information Extraction

Everything is Published

• Recent viewed

– Webpages

– Structures

– Documents

– Searched queries (structure and keyword)

15

Page 16: New Software Developments on Chemical Information Extraction

Availability and Customization

• Source code available

• Minor changes required on example codes

for customization, such as

– Import extracted structures to other databases

– Post-process filtering according to properties

– Batch process of multiple documents

16

Page 17: New Software Developments on Chemical Information Extraction

Hunting for Hidden Treasures

• A CINF Symposium regarding “chemical

information in patents and other documents”

• ACS meeting in Philadelphia, August 19-23,

2012.

• Current speakers from

– Content providers

– Software providers

– Pharmaceutical researchers

17