new software developments on chemical information extraction
TRANSCRIPT
New Software Developments on
Chemical Information Extraction
Wei Deng (David)
ChemAxon’s Naming Technology
• Name to structure
– IUPAC, traditional and common names
– A common name library of existing drugs
– Support CAS Registry number
– Homology group: alkyl, aryl …
– Future: Biological names (PDB code, EC # …)
• Structure to Name
– IUPAC Name, traditional names, common names
– Support other structure features • Isotopes, pseudo-asymmetric stereocenters …
• Accuracy and coverage constantly improving
• Also available from command-line
2
ChemAxon’s “Document to Structure”
• Extract chemical information from documents – Names: powered by the Naming Technology
– Also import SMILES, InChI, CAS number …
– Images: OSRA
– Returns structure and their location in the document
• Works with scanned PDF since 5.8 (Feb 2012)
– Great for patent mining
• OCR and syntax correction constantly developed
– 3-rnethyl-l-me- thoxynaphthalene
– 3-methyl-1-methoxynaphthalene
3
ChemAxon’s “Document to Structure”
• New Features in 5.9 (Mar 2012)
– MS Office document: doc, docx, ppt, pptx, xls, xlsx, odt …
– Embedded structure objects (ChemDraw, Symyx, Marvin
…)
– Progressively display result
– Speed improvement
– Instant JChem Integration; Simplfied API
• Currently in development for 5.10 (May
2012) – OSRA “Confidence”
– Fragment groups integration with Markush generation
– Collaboration with Linguamatics
– IJC (OSRA, Location) 4
From Document to Structures
5
Non-searchable patent (50 pages) Structure (text + image) + location
Search by Structure or Text
6
Non-searchable PDF is now Searchable
7
Free Online Service Chemicalize.org
• Extract chemical information from web pages and PDF documents
• Interactively display all structures and their predicted properties
• Search all structures extracted
• Gather links of interest to chemists for post processing (search,
analysis, reporting, fun…)
• Recently reviewed on Journal of Chemical Information and
Modeling
8
9
Webpage - chemicalized
• All chemical names are highlighted with dotted line
• Mouse over a name pops up the structure image
• Click on the image will direct to the data page
• Links are “respected”
• Customizable report layout for calculation results. Users can move, open, close, expand calculation boxes and this is remembered on the next visit
Data Page: Extensive Predicted Properties
11
• All structures are summarized above the chemicalized page
• Click on a structure to highlight all occurrences. Click again to
navigate to the next occurrence
• All structures can be downloaded as MRV or SDF
Webpage - chemicalized
PDF File - chemicalized
Aspirin: query highlighted in results
Searching Chemicalize.org – Structure Search
• Aspirin; web page hits - “show” related structures
• Autosuggest while typing
Searching Chemicalize.org – Keyword Search
Everything is Published
• Recent viewed
– Webpages
– Structures
– Documents
– Searched queries (structure and keyword)
15
Availability and Customization
• Source code available
• Minor changes required on example codes
for customization, such as
– Import extracted structures to other databases
– Post-process filtering according to properties
– Batch process of multiple documents
16
Hunting for Hidden Treasures
• A CINF Symposium regarding “chemical
information in patents and other documents”
• ACS meeting in Philadelphia, August 19-23,
2012.
• Current speakers from
– Content providers
– Software providers
– Pharmaceutical researchers
17