eugm 2014 - roger sayle (nextmove software): implementing iso standard 11238 compliance with...
TRANSCRIPT
Implementing iso 11238 standard compliance with chemaxon tools
Roger Sayle
Nextmove software, cambridge, uk
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
What is iso 11238?
• ISO standard 11238 entitled “Health Informatics – Identification of medicinal products – Data elements and structures for the unique identification and exchange of regulated information on substances”.
• Defines a framework for uniquely identifying and exchanging compounds of pharmaceutical interest.
• The framework serves a similar role to CAS registry numbers, PubChem CID or InChI-Key, assigning unique identifiers to substances.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
Meet the (IDMP) family
• 11238 is one of a suite of 5 related standards, all for “unique identification and exchange of …”
– 11238 “… regulated information on substances”.
– 11239 “… dose forms, units, administration, etc.”.
– 11240 “… units of measurement”.
– 11615 “… regulated medicinal product information”.
– 11616 “… regulated pharmaceutical product information”.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
Why this is 11238 important?
• EU regulation 520/2012 on “pharmacovigilance” requires countries, regulatory authorities and pharma to adopt the 5 IDMP standards (articles 25 and 26) by 1st July 2016 (article 40).
• Executive summary: It’s the law!
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
How it works
Code Assignment (Authority)
Code Look-up (Authority)
Name/Identifer
Connection Table
Properties (Significant Text)
Unique Code
Unique Code
Name/Identifer
Connection Table
Properties (Significant Text)
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
Likely implementation
Code Assignment (Authority)
Code Look-up (Authority)
Name/Identifer
Connection Table
Properties (Significant Text)
Unique Code
Unique Code
Name/Identifer
Connection Table
Properties (Significant Text)
FDA UNII
FDA SRS Search FDA UNII
XML
INN/USAN/CID
FDA/NCATS GInAS
MOL2000/SMILES/InChI Protein/NA Sequence
ISO11238 Groups 1-4
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
Current status
• The standard has been ratified and it use has been written into EU law (EU Reg. 520/2012).
• Framework requires use of non-semantic, random, fixed length unique identifiers, that include an internal integrity check.
• The standard also details constraints on uniqueness.
• Exact implementation details yet to be determined (to appear in a future “Implementation Guide”).
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
What will the future look like?
• ISO11238 compliant identifiers will be very similar to the FDA’s UNII (UNique Ingredient Identifier).
• The fixed width non-semantic identifier requirement rules out the use of plain SMILES, InChI, V2000 Mol file and similar encodings.
• The random requirement rules out plain CAS registry numbers, PubChem CIDs and ChEMBL IDs (which use sequential or monotonic number assignment).
• Alternatively, InChI keys or similar hashes (with [CRC] checks) of connection tables+text may be possible.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
What’s available now
• ISO charge for access to official standards documents (which is why 5 IDMP standards is more profitable than one), about 158 CHF ($177 USD) from ISO for 11238 [between $120 and $340 online].
• However, as with many ISO standards, late drafts of ISO 11238 are freely available on the internet.
• Caution: Many of the technical examples (all XML) were removed from the final standard and are due to appear in the upcoming “Implementation Guide(s)”.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
Example requirement
• §3.4 “Naming of substances” states “at least one substance name or company code shall be associated with each substance”.
• For the envisioned work flows this typically assumes INN or USAN name has already been assigned.
• One way to guarantee the existence of a suitable substance name for investigational compounds is to use IUPAC naming software (such as ChemAxon’s) during submission to the unique coding authority.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
• Plug: ChemAxon s2n coverage is state-of-the-art.
The devil is in the details
• One of the interesting cheminformatics challenges with working with the published ISO standard and the examples from the draft annex is the typography.
• The document has been typeset by editors with expertise outside the field of cheminformatics who have inadvertently changed whitespace without appreciating the impact this has on chemistry tools.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
Final ISO11238 standard Annex A
• §A.2.3 SMILES uses the example “C1 = CC = CC = C1” where the spurious spaces create problems for SMILES readers.
• §A.2.4 InChI both strips the “InChI=” prefix and again suffers from spaces “1/C6H6 /c1-2-4-6-5-3-1/h1-6H”.
– Interestingly this is an old InChI not a standard InChI.
• §A.2.2 Molfile fails to mention that V2000 mol files use fixed width columns and blank lines, as a result the example given in text *next slide+ can’t easily be read.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
Annex A: example.mol
ACD/Labs0812062058
6 6 0 0 0 0 0 0 0 0 1 V2000
1.9050 −0.7932 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.9050 −2.1232 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7531 −0.1282 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7531 −2.7882 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
−0.3987 −0.7932 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
−0.3987 −2.1232 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2 1 1 0 0 0 0
3 1 2 0 0 0 0
4 2 2 0 0 0 0
5 3 1 0 0 0 0
6 4 1 0 0 0 0
6 5 2 0 0 0 0
M END
$$$$
Missing Blank Lines
Incorrectly aligned columns
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
Benefit of the doubt?
• These unintentional typographical errors in the normative text may perhaps be the result of poor fonts, with the exception of “InChI=”.
• Alas the content of the original Annex B from the draft indicate these issues were more widespread and may arise from ignorance of cheminformatics file formats.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
§B.2.2 InChI in XML Example
<STRUCTURAL_REPRESENTATION_TYPE>INCHI</STRUCTURAL_REPRESENTATION_TYPE>
<STRUCTURAL_REPRESENTATION>1S/C2H5NO2.AL.CLH.2H2O.ZR/C3-1-
2(4)5;;;;;/H1,3H2,(H,4,5);;1H;2*1H2;/Q;+3;;;;+4/P-
2</STRUCTURAL_REPRESENTATION>
Missing InChI=
Standard and Non-Standard InChI?
Converted to upper case
Indentation
Spurious Spaces
Line Breaks
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
§B.2.4 V2000 Mol File in XML Example
<STRUCTURAL_REPRESENTATION_TYPE>MOL</STRUCTURAL_REPRESENTATION_TYPE>
<STRUCTURAL_REPRESENTATION>30 29 0 0 0 0 0 0 0 0999 V2000 9.9563 -7.3055 0.0000 Y
1 1 0 0 0 0 0 0 0 0 0 0 15.0355 -4.8847 0.0000 * 0 0 0 0 0 0 0 0 0 0 0 0 13.3609 -
8.0134 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 13.8867 -9.9869 0.0000 O 0 5 0 0 0 0 0 0 0 0 0
0 6.4178 -6.8678 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 5.8872 -4.8955 0.0000 O 0 5 0 0 0 0
0 0 0 0 0 0 6.7218 -5.7285 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 13.0541 -9.1519 0.0000 C
0 0 0 0 0 0 0 0 0 0 0 0 13.3408 -6.8634 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 13.8599 -
4.8881 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 13.0301 -5.7260 0.0000 C 0 0 0 0 0 0 0 0 0 0 0
0 5.9099 -9.9441 0.0000 O 0 5 0 0 0 0 0 0 0 0 0 0 6.4492 -7.9743 0.0000 O 0 0 0 0 0 0
0 0 0 0 0 0 6.7482 -9.1149 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 7.8605 -5.4221 0.0000 C 0
0 0 0 0 0 0 0 0 0 0 0 11.8897 -5.4263 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 11.9147 -9.4555
0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 7.8855 -9.4263 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
7.6897 -8.0305 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 7.6897 -6.8513 0.0000 C 0 0 0 0 0 0 0
0 0 0 0 0 8.7018 -6.2618 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 9.2908 -5.2506 0.0000 C 0 0
0 0 0 0 0 0 0 0 0 0 10.4700 -5.2524 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 11.0577 -6.2664
0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 12.0761 -6.8427 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
12.0891 -8.0218 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 8.7257 -8.5952 0.0000 N 0 0 0 0 0 0
0 0 0 0 0 0 11.0839 -8.6223 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 10.4848 -9.6275 0.0000
C 0 0 0 0 0 0 0 0 0 0 0 0 9.3057 -9.6139 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 10 2 1 0 0 0 0
8 3 2 0 0 0 0 25 24 1 0 0 0 0 8 4 1 0 0 0 0 27 18 1 0 0 0 0 7 5 2 0 0 0 0 26 28 1 0 0 0 0
7 6 1 0 0 0 0 19 27 1 0 0 0 0 15 7 1 0 0 0 0 20 21 1 0 0 0 0 17 8 1 0 0 0 0 30 27 1 0 0 0
0 11 9 2 0 0 0 0 30 29 1 0 0 0 0 11 10 1 0 0 0 0 20 19 1 0 0 0 0 16 11 1 0 0 0 0 22 21 1
0 0 0 0 14 12 1 0 0 0 0 23 24 1 0 0 0 0 14 13 2 0 0 0 0 18 14 1 0 0 0 0 26 25 1 0 0 0 0
21 15 1 0 0 0 0 29 28 1 0 0 0 0 24 16 1 0 0 0 0 23 22 1 0 0 0 0 28 17 1 0 0 0 0 M CHG 4
1 3 4 -1 6 -1 12 -1 M ISO 1 1 90 M END </STRUCTURAL_REPRESENTATION>
Where to begin?
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
All is not lost!
• Back at the 2011 ChemAxon UGM here in Budapest, Sorel Muressan from AstraZeneca Sweden gave a presentation on how spelling correction improves the recall of ChemAxon’s name-to-structure tools.
• The exact same CaffeineFix technology can be applied to perform aggressive “spelling correction” on SMILES strings, InChI and V2000 mol files.
• As with IUPAC-like systematic names, these can each be specified by a formal grammar.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
How the algorithm works
• The regular expression describing a V2000 mol files is compiled into a “finite state machine” with 1333 states.
• The only allowed “corrections” are the deletion of new lines and the insertion of spaces or new lines, but only where permitted in the grammar/FSM.
• Depth-first recursion is used to identify a minimal set of edits to correct the input.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
§B.2.4 example after correction
30 29 0 0 0 0 0 0 0 0999 V2000
9.9563 -7.3055 0.0000 Y 1 1 0 0 0 0 0 0 0 0 0 0
15.0355 -4.8847 0.0000 * 0 0 0 0 0 0 0 0 0 0 0 0
13.3609 -8.0134 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
13.8867 -9.9869 0.0000 O 0 5 0 0 0 0 0 0 0 0 0 0
6.4178 -6.8678 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.8872 -4.8955 0.0000 O 0 5 0 0 0 0 0 0 0 0 0 0
6.7218 -5.7285 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
13.0541 -9.1519 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
13.3408 -6.8634 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
13.8599 -4.8881 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
...
21 15 1 0 0 0 0
29 28 1 0 0 0 0
24 16 1 0 0 0 0
23 22 1 0 0 0 0
28 17 1 0 0 0 0
M CHG 4 1 3 4 -1 6 -1 12 -1
M ISO 1 1 90
M END
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
3 line Header Block before Count Line
Chemaxon toolkit implementation
public static Molecule molFileToChemaxonMol(String molFileStr) throws MolFormatException {
try {
return MolImporter.importMol(molFileStr);
}
catch (MolFormatException e) {
molFileStr = FixMolFile.fixMolFile(molFileStr);
if (molFileStr == null){
throw e;
}
return MolImporter.importMol(molFileStr);
}
}
// Java source code available at http://www.chemaxon.com/forum/ftopic1265.html
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
Geek of the week
• A particularly tricky corner case concerns Accerlys’ Pipeline Pilot-style V2000 mol files which abbreviate the columns in the atom block (to save space).
• In these files there’s potential ambiguity where the first bond line is mistaken as a continuation of the last (abbreviated) atom line.
• Our solution relies on the atom stereo care field being zero in non-query mol files vs. the non-zero values that appear in the first three fields of bonds.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
Lest we forget
• A similar “spelling correction” variant that allows uppercase characters to be mapped to lowercase, and the prefix “InChI=” to magically appear at the start of a string can also be used to fix ISO InChIs.
• Alas uppercasing an InChI (or any molecular formula) is potentially lossy, e.g. “CsN” vs. “CSn”.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
Before and after InChI example
1S/C17H21CLN4O/C1-22-12-3-2-4-13(22)8-11(7-12)21-17(23)14-5-10(18)6-15-16(14)20-9-19-15/H5-6,9,11-13H,2-4,7-8H2,1H3,(H,19,20)(H,21,23)
InChI=1S/C17H21ClN4O/c1-22-12-3-2-4-13(22)8-11(7-12)21-17(23)14-5-10(18)6-15-16(14)20-9-19-15/h5-6,9,11-13H,2-4,7-8H2,1H3,(H,19,20)(H,21,23)
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
How common are the ambiguities?
• 1.35 million standard InChIs from ChEMBL
• Uppercase the InChIs, fix them and check whether the original InChI can be regenerated
• 99.5% roundtrip (6596 discrepancies)
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
Inchi case-insensitive ambiguities
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
conclusions
• The Java source code for recovering V2000 mol files and InChIs from the types of corruption seen in the ISO 12238 draft has now been contributed to the ChemAxon forum, allowing Marvin and JChem to read the examples given in that document.
• Whether this functionality will be required to fully support the final (pending) “Implementation Guide” requirements remains to be seen (and voted upon).
• Attention to detail is important in standards writing.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
Final words
• ISO 11238 IDs may become as popular as Chemical Abstracts’ registry numbers.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
acknowledgements
• Daniel Lowe, NextMove Software, Cambridge, UK.
• Richard Bolton, GSK, Stevenage, UK.
• Evan Bolton, NCBI PubChem, Bethesda, MD, USA.
• Dac-Trung Nguyen, NIH NCATS, Rockville, MD, USA.
• Tyler Peryea, NIH NCATS, Rockville, MD, USA.
• Noel Southall, NIH NCATS, Rockville, MD, USA.
• Yulia Borodina, FDA, Silver Spring, MD, USA.
• Lawrence Callahan, FDA, Silver Spring, MD, USA.
• Andrew Marr, Marr Consultancy, Knebworth, UK.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014