development of chemical markup language (cml) as a system …€¦ · variety of purposes (e.g....

48
Development of Chemical Markup Language (CML) as a System for Handling Complex Chemical Content Michael Wright1 - Supervisor: Dr. H. S. Rzepa1 - project in collaboration with S. Zara1 and Peter Murray-Rust2 (1) Department of Chemistry, Imperial College of Science, Technology and Medicine, UK - (2) School of Pharmaceutical Sciences, University of Nottingham, UK - May 22, 2000 We report the conclusion of an 8 month project culminating in the first fully operational system for managing complex chemical content entirely in interoperating XML-based markup languages.1 This involved the extension of published Chemical Markup Language (CML 1.0)2 and the development of mechanisms allowing the display of CML marked up molecules, spectra and reactions within a standard web browser (Internet Explorer 5.x).3 Integrating these techniques with existing XML compliant languages (e.g. XHTML4 and SVG56 ) results in electronic documents with the significant advantages of data retrieval and flexibility over existing HTML/plugin solutions. These documents can be optimised for a variety of purposes (e.g. screen display or printing7 ) by single stylesheet transformations. An XML schema89 has also been developed for CML 1.0 to allow document validation and the use of data links. The ChiMeraL website,1 containing a range of online demonstrations, examples and CML resources has been built as an integral part of this project. 1 - Introduction The World-Wide-Web was originally conceived as a collaborative tool for scientists, allowing the rapid distribution and publication of results and greatly improved online communication. It has become increasingly common for academic papers to be posted online as HTML (hypertext markup language) pages, this being an extremely efficient way of making them available to everyone.3 Databases and knowledge archives are increasingly becoming 'web enabled' allowing faster, easier and more widely available access. This reduces unnecessary or parallel research and allows significantly more efficient literature searches. However, awkward handling of non-textual information (e.g. molecular structures) and difficulties in automatically recognising and extracting data from HTML pages limit their potential.10 11 12 More sophisticated mechanisms are required to solve these problems, and these are presently being developed. This project involved the development and extension of chemical markup language (CML)2 and techniques allowing the display of molecules, spectra and reactions within a web browser. These chemical objects can be seamlessly integrated into existing textual markup to create complete electronic documents (an example being this report). 1.1 - Existing Solutions - HTML and plugins The powerful text formatting language HTML and its interface, the web browser are familiar tools to many scientists. While able to handle bitmapped images (normally gif, jpg and png), HTML was originally designed to display plain text on a computer screen. Its markup 'tags' are primarily concerned with formatting textual objects (paragraphs, tables, headings etc.) and supporting inter-document linking via the hypertext mechanism. ChiMeraL - xmldoc_writeup (May 22, 2000) - 1

Upload: others

Post on 20-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

Development of Chemical Markup Language(CML) as a System for Handling ComplexChemical ContentMichael Wright1 - Supervisor: Dr. H. S. Rzepa1 - project in collaboration with S.Zara1 and Peter Murray-Rust2

(1) Department of Chemistry, Imperial College of Science, Technology and Medicine, UK - (2) School of PharmaceuticalSciences, University of Nottingham, UK - May 22, 2000

We report the conclusion of an 8 month project culminating in the first fully operational system formanaging complex chemical content entirely in interoperating XML-based markup languages.1 Thisinvolved the extension of published Chemical Markup Language (CML 1.0)2 and the development ofmechanisms allowing the display of CML marked up molecules, spectra and reactions within a standardweb browser (Internet Explorer 5.x).3 Integrating these techniques with existing XML compliant languages(e.g. XHTML4 and SVG5 6 ) results in electronic documents with the significant advantages of dataretrieval and flexibility over existing HTML/plugin solutions. These documents can be optimised for avariety of purposes (e.g. screen display or printing7 ) by single stylesheet transformations. An XMLschema8 9 has also been developed for CML 1.0 to allow document validation and the use of data links.The ChiMeraL website,1 containing a range of online demonstrations, examples and CML resources hasbeen built as an integral part of this project.

1 - Introduction

The World-Wide-Web was originally conceived as a collaborative tool for scientists, allowing therapid distribution and publication of results and greatly improved online communication. It hasbecome increasingly common for academic papers to be posted online as HTML (hypertextmarkup language) pages, this being an extremely efficient way of making them available toeveryone.3 Databases and knowledge archives are increasingly becoming 'web enabled' allowingfaster, easier and more widely available access. This reduces unnecessary or parallel researchand allows significantly more efficient literature searches. However, awkward handling ofnon-textual information (e.g. molecular structures) and difficulties in automatically recognising andextracting data from HTML pages limit their potential.10 11 12

More sophisticated mechanisms are required to solve these problems, and these are presentlybeing developed. This project involved the development and extension of chemical markuplanguage (CML)2 and techniques allowing the display of molecules, spectra and reactions within aweb browser. These chemical objects can be seamlessly integrated into existing textual markup tocreate complete electronic documents (an example being this report).

1.1 - Existing Solutions - HTML and plugins

The powerful text formatting language HTML and its interface, the web browser are familiar toolsto many scientists. While able to handle bitmapped images (normally gif, jpg and png), HTML wasoriginally designed to display plain text on a computer screen. Its markup 'tags' are primarilyconcerned with formatting textual objects (paragraphs, tables, headings etc.) and supportinginter-document linking via the hypertext mechanism.

ChiMeraL - xmldoc_writeup (May 22, 2000) - 1

Page 2: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

HTML documents rely on human processing for content interpretation and error detection sincethe language's syntax is too flexible to be reliably interpreted by machines. It cannot be validatedwith respect to the document's integrity and there is very little markup of context (to indicate themeaning and relevance of each information component). Non textual data normally requires theuse of browser 'plugins' (small, platform specific helper programs) or Java 'applets' (nonplatform specific programs, that run on a 'virtual machine' on top of the web browser) - see Figure1.1.1. As an example, a present day online chemical paper might consist of HTML text, severalstatic images and a molecular structure intended to be visualised using a plugin. A common pluginis MDL's Chime13 (based on the open source Rasmol viewer14 ) which like most of these programs,requires an external data file in one of a variety of 'legacy' formats e.g. MDL .mol or Brookhaven.pdb.15 These files are widely available online and use ASCII but their syntax is heavily implied andrequires specialist knowledge to interpret. This severely limits the re-usability of the rapidlygrowing amount of high-quality chemistry available on Web pages and a significant amount oftime and information is wasted converting between these files.

Figure 1.1.1: Comparing different ways of displaying chemical structures

Taxol intermediate as a .gif image (bitmap) -a 'dead' format, human comprehensible butchemical information can't be automaticallyextracted

Riboflavin using a plugin - Uses the EMBEDtag and an external data file

Cochineal using Chemaxion's Marvin applet -the structure is being stored as CML withinthis document, hence no data file is needed

Since HTML4 wasn't originally designed to support non textual data, there is no built in mechanismallowing molecular structures to be shared between a plugin and the parent document. As aresult, the external data files become isolated from the text and from each other. This reducesinter-operability and makes it significantly harder to search or data-mine for information withinthese files, while at the same time retaining their context. An extensive literature search throughHTML pages using a conventional search engine requires significant human interaction in order toidentity useful 'hits'. If the information was labelled by context, far more could be handledautomatically, e.g. filtering and ignoring identical data from different sources. The main problemswith existing HTML/plugin solutions include:

1 - The intimate binding of content and formatting in the HTML document. This restricts thedocument to a single style (fixed colours, object layout, plugin choice etc.) and limits the amountof additional information that can be supplied without overwhelming the reader. More importantly italso means text marked up using HTML is effectively lost as far as machine readability isconcerned.

ChiMeraL - xmldoc_writeup (May 22, 2000) - 2

Page 3: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

2 - The use of legacy formats with implied syntax and the resulting multiplicity of standards. Asingle, human readable format is required, combining both textual and non textual informationwithin a single document.

Additional considerations include scalability, validation, platform independence (particularlyrelevant online), portability and flexibility.16 Considering the present cost of computer hardwareand improvements in communication bandwidths, file size is significantly less of a concern. Manylegacy formats are designed to be extremely brief (described as terse) and hence have a largeamount of implied syntax.

1.2 - Extensible Markup Language (XML)

In February 1998, the World Wide Web Consortium (W3C) published their recommendations forextensible markup language (XML)17 18 as a successor to and superset of, HTML. XML isn't amarkup language in its own right, instead it is a series of rules and conventions defining how tobuild discipline-specific markup languages (a so called 'meta-markup language'). XML is designedto allow the description of any structured data by the use of an user defined set of tags. Thesetags (called elements in XML nomenclature) support precise declarations of a document'scontent and context using explicit syntax. A defined set of elements is called an XML languageand these languages are far more flexible than HTML, since they can be tuned to a particular typeof information. Once the language and the means to display it have been developed, non-textualmark-up becomes trivial.

Figure 1.2.1: Data flow in an XML/XSL transform (SVG Diagram - an XML document can be transformed by one of many stylesheetsdepending on the required output.

All XML compliant languages use the same syntax (looking superficially similar to HTML) and canbe manipulated using any XML tools or applications. Information marked up using differentlanguages can be easily combined into a single document and manipulated at the same time.

ChiMeraL - xmldoc_writeup (May 22, 2000) - 3

Page 4: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

These different languages; e.g MathML,19 CML,20 SVG (Scalable Vector Graphics)5 are identifiedwithin the document by the use of discrete namespaces21 (described later) and are often gluedtogether using XHTML (HTML following XML rules). An online XML paper might consist of XHTMLtext combined with SVG diagrams and CML structures or spectra.

Unlike HTML, XML elements do not describe a document's formatting and the browser is nolonger able to comprehend and display the source directly (as it can with HTML). Instead theparser uses an external stylesheet, which contains formatting instructions for each element in theXML source. This stylesheet is supplied by either the author or the reader and effectively mapsthe contents of each element to an output. This would normally be a HTML page or some othertext file but is not required to be (they can for example, be mapped to a Acrobat file).22

The older cascading stylesheets (CSS) are commonly used with HTML.23 They provide amechanism for centralising formatting and layout instructions from many pages to a singlestylesheet. CSS works by allowing the author to redefine the meaning of existing HTML tags, e.g.<H1> can be defined as meaning 'size 11 and red'. This is useful when the style of an entire website needs to be changed, since it only requires rewriting the stylesheet and not every HTMLpage. Much more sophisticated stylesheets can be written using a powerful formatting, searchingand scripting language XSL (extensible stylesheet language).24 A significant proportion of thisproject involved the construction of a range of chemically useful stylesheets.

Since any number of stylesheets can be used with a single XML document, it can be displayed ina large variety of ways (Figure 1.2.1). The searching and scripting capabilities of XSL mean thatelement(s) can be picked out and displayed and/or manipulated according to their contents. Otherinformation can be ignored if required, making data searches from large XML documents trivial. Apaper containing a mixture of text and chemical information can be scanned for spectra by thestylesheet and these displayed using a suitable applet. Alternatively, a different stylesheet mightbe used to identify molecular structures and potentially calculate their properties.

Figure 1.2.2: Dynamic selection of stylesheets - this is a much simpler version of the ChiMeraL demonstration

Stylesheet transformations are carried out using software called an XML/XSL parser - this may bea dedicated program or a component built into an existing web browser (as with Internet Explorer5.x).25 Stylesheet selections can be defined by a URL (uniform resource locator - also called a webaddress) from within the document or selected using the parser. This allows a reader to use theirown stylesheets. For example, they might dislike or be unable to use the default molecular viewerand would prefer one of their own. This flexibility makes stylesheets very powerful and is

ChiMeraL - xmldoc_writeup (May 22, 2000) - 4

Page 5: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

exemplified by the ChiMeraL demonstration1 where sample CML data files containing moleculeand spectra can be transformed with a range of XSL stylesheets.

The use of a unified XML format reduces the time wasted in processing legacy files and avoidsthe risk of serious information loss due to poor semantics. This "plug-compatible" XML approachguarantees a document to be searchable, sortable, mergeable, and printable with minimal extracost. Ideally all future document and publishing systems will be transferred to using XMLlanguages.

1.3 - Chemical Markup Language (CML 1.0)

The definition of CML version 1.0 was published last year by Peter Murray-Rust and Henry S.Rzepa.2 It was developed to carry molecules, cystallographic data and reactions using an XMLlanguage and for the first time offers a universal, platform and application independent format forstoring and exchanging chemical information. As the first generation of CML, it outlines a varietyof general purpose 'data-holder' elements and a smaller number of more specifically chemicalelements (e.g. <molecule> , <reaction> , <crystal> ) used to indicate chemical 'objects'. Forexample, a <molecule> will contain a <list> of <atom> s, which in turn have three <float> sgiving each atoms Cartesian coordinates.

This framework is flexible but leaves many areas open for later evolution. In particular, CMLprovides no default conventions for labelling data elements and puts few restrictions on elementordering. This is intentional as CML is designed to be generic and contains minimalpreconceptions as to the type of chemical information that will be stored using it. Standardisationand conventions are the concern of the community (CML's users) and will be co-ordinated bypublications and projects such as this one.

We have developed rigorous markup procedures for small and medium molecule structures andspectra using an additional element <spectrum> which is not part of CML 1.0. Exploratory workhas been carried out into <reaction> markup and data linking within a CML/XHTML document.Simply defining such procedures is of limited use and a variety of XSL template 'fragments' havebeen developed. These allow molecules, spectra and reactions to be displayed using a variety ofexisting Java applets and can be used for the construction of stylesheets able to format mixedXML documents containing chemical information. These stylesheet fragments and examples areoffered as resources to the CML community. The culmination of this work is the development ofthe first fully operational demonstration of CML parsing within a web browser

The aim of this project is to raise the profile of CML and promote its acceptance and use in thechemical community. All work is offered as open source and is intended to help stimulate thedevelopment of new ideas, tools and resources in this area.

ChiMeraL - xmldoc_writeup (May 22, 2000) - 5

Page 6: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

2 - XML, Namespaces and Schema

The markup syntax used for XML languages is superficially similar to that used for HTML (thoughsignificantly stricter).18 Both types of markup are written using ASCII and consist of data unitscontained within nested tag pairs (identifiable as an open <tag> and a close </tag> ). In HTML,these tags are often used to describe formatting properties, e.g. <FONT SIZE="+1"

COLOR="#FF0000">here is some <B>bold</B> text</FONT> . Since XML languages strictlyseparates out formatting from semantic markup, they describes the meaning and context of theirdata as opposed to its formatting e,g. <float title="melting point"

/float> ). In XML nomenclature, tag pairs are called 'elements' (not to be confused with chemicalelements!) and an element's content consists of any attributes of that element, their values, andany text or sub-elements nested within the tag pair. Text or sub elements within a parent elementare described as children of that element and siblings of each other.

2.1 - XML Rules

The most important rules of XML syntax are as follows. A document that following these isdescribed as 'well formed'.

1 All tags must be closed to form an element e.g. <formula>C8 H10 N4 O2</formula> and must becorrectly nested (in contrast with HTML, were incorrect nesting is normally ignored).

2 'Empty' elements with no children may be written using the shorthand <link/> = <link></link>3 All attribute values must be within quotes (") e.g. <string title="CAS"> (this report will use a

convention; @title = an attribute named 'title')4 Element and attribute names are case sensitive and are normally written using lower case to

distinguish them from HTML.5 Code comments are of the form <!-- comment --> and are ignored by the parser.

Figure 2.1.1: Simple XML document - note the 'tree' structure

<?xml version="1.0"?><?xml-stylesheet type="text/xsl"

href="http://www.ch.ic.ac.uk/chimeral/publications/document.xsl" ?>

<document title="Simple example of an XML document" id="xmldoc_simpleEx"xmlns:xhtml="http://www.w3.org/1999/xhtml"xmlns:cml="x-schema:http://www.ch.ic.ac.uk/chimeral/cml_schema_ie_02.xml">

<xhtml:P>This is a block of text formatted using XHTML (HTML followingXML rules). Normal <xhtml:I>italic</xhtml:I> and <xhtml:B>bold</xhtml:B>elements can be used as can more complex <xhtml:FONT FACE="Helvetica"COLOR="#000080">fonts and colours.</xhtml:FONT> Following this is a blockof molecular information marked up using CML. Additional blocks usingother XML languages could be added at will.<xhtml:/P>

<cml:cml title="Properties of Caffeine" id="cml_caffeine"><cml:molecule title="caffeine" id="mol_caffeine">

<cml:formula>C8 H10 N4 O2</cml:formula><cml:string title="CAS">58-08-2</cml:string><cml:float title="molecule weight">194.19</cml:float><cml:float title="melting point" units="degC">238</cml:float><cml:string title="comments">White powder or white glistening needles

usually melted together. LIGHT SENSITIVE</cml:string><cml:list title="alternate names">

<cml:string title="name">1,3,7-Trimethylxanthine</cml:string><cml:string title="name">1,3,7-Trimethyl-2,6-dioxopurine</cml:string><cml:string title="name">7-Methyltheophylline</cml:string>

</cml:list></cml:molecule>

</cml:cml></document>

An example of a short XML document is given in Figure 2.1.1 which contains both XHTML andCML marked up information. Those familiar with HTML will recognise the syntax but will find theelement names unfamiliar. Notice how information can be stored both as an element child andalso as the value of an attribute.

ChiMeraL - xmldoc_writeup (May 22, 2000) - 6

Page 7: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

The structure of this document is best visualised as a branched 'tree' (similar to the directorystructure on a hard-drive). It has a single 'root' (in this case <document> ) with a number ofsub-elements (<P> and <cml> , ignoring the prefix) and sub-sub-elements etc. defining the logicalstructure of the document. Each branch in the tree may contain information either as a text childor as attribute values. Each piece of information is therefore uniquely described by a list of itsancestors and this provides a convenient method for stylesheets to navigate the XML tree (calledpattern matching).

Looking back at Figure 2.1.1 we can see that this document contains a chemical formula, CASnumber, molecular weight, melting point and three alternate names for caffeine. In addition, thereare two processing instruction tags (line 1 and 2) that use the syntax <?tag_name_here?> . Thefirst, <?xml version="1.0"?> indicates that the document has been written to follow XML rules.The second, <?xml-stylesheet .. ?> gives the URL of the default stylesheet. Unless givenover-riding instructions (e.g. by a reader who wishes to use their own stylesheet) the parser willautomatically obtain this stylesheet and use it to transform (format) the XML document. Thistransformation can be carried out locally, on a web server or using a combination of the two.

When writing an XML document, it is usual to collect data of similar types and 'pigeon hole' themtogether in a suitably logical structure. For example; the formula for caffeine is stored in branch\document\cml\molecule\formula. Figure 2.1.2 shows a simplified CML document tree.

Figure 2.1.2: Illustrating the CML element 'tree' - (SVG Diagram) note how a molecule contains three lists, one of alternate names, one ofatoms with their coordinates and one of bonds and their atom references (the atoms the bond is between)

ChiMeraL - xmldoc_writeup (May 22, 2000) - 7

Page 8: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

2.2 - Namespacing

The attribute:

xmlns="x-schema:http://www.ch.ic.ac.uk/chimeral/cml_schema_ie_02.xml"

on <document> in Figure 2.1.1 is called an XML namespace declaration.21 XML rules allow forthe mixing of XML languages within the same document and there is no mechanism to preventpotentially conflicting element names being used in two or more different languages. Indeed sucha mechanism would greatly restrict development of XML. Instead, the author assigns eachlanguage a different namespace prefix, using the syntax: <namespace:element> and@namespace:attribute . The choice of prefix for any particular language is left to the author andthese need only be unique over the document in question. A namespace declaration (normallyfound on the root element) is then used to index each prefix to a globally unique URL (often thelanguage development site). Alternatively, it can point to a file containing a description of thelanguage, as it does for the CML namespace.

There are two types of namespace declaration. The simplest is of the form@xmlns:mynamespace="http://www.uniqueURL.com" and declares any elements with amynamespace: prefix as belonging to a language defined at 'http://www.uniqueURL.com'. It isassumed that attributes share the namespace of their element unless declared otherwise. Adefault namespace declaration; @xmlns="http://www.anotherURL.com" , results in the elementcontaining the declaration and all ancestors of that element, belonging to"http://www.anotherURL.com" unless otherwise stated. This avoids having to use a prefix in frontof a large number of similar elements.

Figure 2.2.1: Namespaces used in a XHTML/CML document

<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="document.xsl" ?><karne:document

xmlns="http://www.w3.org/1999/xhtml"xmlns:karne="x-schema:http://www.ch.ic.ac.uk/chimeral/karne_schema_01.xml"xmlns:chimeral="x-schema:http://www.ch.ic.ac.uk/chimeral/spectrum_schema_ie_01.xml">

<karne:abstract>We report the conclusion an 8 month project, culminating in the ..</karne:abstract>

<karne:chapter id="chap_introduction" title="Introduction"><karne:index>1</karne:index>

<P>The <B>World-Wide-Web</B> was originally developed as a collaborative toolfor scientists, allowing for rapid distribution and publication of results andgreatly improved communication by email. It has become increasingly common foracademic papers and results to be posted online, this being an extremely ..</P>

</karne:chapter>

<cml title="cml data block" id="cml_moldata"xmlns="x-schema:http://www.ch.ic.ac.uk/chimeral/cml_schema_ie_02.xml">

<molecule title="caffeine" id="mol_caffeine">..</molecule><chimeral:spectrum title="Furan, tetrahydro-" id="spect_furantetrah_ir_1">..</chimeral:spectrum>

</cml>

</karne:document>

Figure 2.2.1 shows a document mixing XHTML, published CML 1.0,2 ChiMeraL1 CML (which alsoincludes <spectrum> ) and a simple document structure language (developed for this report). Thefour namespace declarations are as follows:

ChiMeraL - xmldoc_writeup (May 22, 2000) - 8

Page 9: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

• xmlns="http://www.w3.org/1999/xhtml"All elements default to the XHTML namespace (defined by the W3C) unless otherwise stated. There isno need to write stylesheet entries to transform this language, instead it can be designed to pass theseunrecognised elements directly to the browser (which already knows how to format XHTML). This'bypass' avoids unnecessary stylesheet templates and makes stylesheet development significantlyquicker.

• xmlns:chimeral="x-schema:http://www.ch.ic.ac.uk/chimeral/spectrum_schema_ie_01.xml"One element in <cml> is not in the default cml: namespace, this is <spectrum> and requires anamespace of its own - chimeral: . As with the two previous declarations, the URL points to a XMLschema for this language.

• xmlns:karne="x-schema:http://www.ch.ic.ac.uk/chimeral/karne_schema_01.xml"This language describes the logical structure of the document e.g. <abstract> , <chapter> ,appendix>

• xmlns="x-schema:http://www.ch.ic.ac.uk/chimeral/cml_schema_ie_02.xml"This declaration is found on <cml> and changes the default namespace for that element and itschildren, anything within <cml> is CML unless declared otherwise

2.3 - Schema and DTD

The URL in a namespace declaration may point to a file describing the XML language. This canbe a simple HTML page but a more powerful method is to point to a file defining the languageusing a machine readable syntax. This allows the parser to automatically validate (check) thedocument against the language(s) it uses, before transforming it with a stylesheet. The parser canalso be informed if elements belong to a particular data-type (string, integer, etc.) or of 'special'attributes exist (e.g. unique @id or @href links) and validate these also. If a document is correctXML (well formed) but does not comply to the definitions of the languages it uses, it is describedas non-valid and can not be parsed. For example, <float builtin="CAS">58-08-2</float> iswell formed XML but is not valid CML since <float> may only contain a single floating pointnumber.

There are two alternative standards for writing machine readable descriptions. In both cases, thefile lists all valid elements and attributes for the language and can declare restrictions on elementcontent and child ordering. The older standard is called DTD (document type definition) and usesa specialised machine readable (but not particularly human readable) syntax. DTDs have widesupport amongst XML tools and parsers but are being superseded by XML schemas.8 These arewritten using an XML language defined by the W3C and can therefore be parsed with the sametools as XML documents and XSL stylesheets. Schemas allow more sophisticated descriptions ofvalid element trees and they are significantly more human readable. Parsers can recognise aschema referenced in a namespace declaration and automatically validate the document againstit.9 If a document consists of a mixture of XML languages, the use of a schema for each allowsmore complex manipulations (for example the linking of data between languages). Since themodular structure of XML allows (and indeed expects) the addition and extraction of information'objects' using many languages, this flexibility is very important.

The CML schema (version 0.2) was developed from the published DTD as part of this project.Two versions have been made available, the first is a generic XML schema, the second isoptimised for Internet Explorer 5 and includes additional data-type and linking declarations. A fulldescription of this schema can be found in Appendix C. Since <spectrum> is not included in theCML 1.0 DTD, an additional namespace and schema have been constructed for it. It is expectedthat additional elements will be required as further areas of chemical markup are developed and it

ChiMeraL - xmldoc_writeup (May 22, 2000) - 9

Page 10: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

is suggested that this approach be maintained. Once a particular addition has been reviewed andaccepted by the community, it can then be incorporated into a future version of published CMLand included in the schema. The CML schema is presently at http://www.ch.ic.ac.uk/chimeral, thisis likely to change to http://www.xml-cml.org20 as a permanent CML namespace

2.4 - Platform Choice

There is a wide variety of XML parsers presently under development,26 many written using theprogramming language Java.27 Java programs are run within a software 'virtual machine' and notdirectly on a computers operating system. Since this software has been written for most operatingsystems (Windows, Mac OS, Linux, Unix, etc.), Java programs can be ported between theseplatforms with minimal effort. This platform-independence means Java is popular and versatile butprograms written using it can be non trivial to install and require the virtual machine to run. It wasthe aim of this project to build a fully working online demonstration of real time CML parsing andtransforming using a ubiquitous and familiar interface - namely the web browser. This restrictsfunctionality (Java parsers are significantly more advanced) but allows CML examples to be runusing software already installed in a large number of computers and the requires minimalknowledge of XML from the user.

The choice of browser is somewhat restricted. Later versions of Windows Internet Explorer 4.xhave some limited XML support and this is much improved in version 5 onwards. It now includesfull DOM (Document Object Model),28 XSL and XQL (extensible query language) support.Netscape 6 (based on the open source Mozilla project)51 supports XML but is only able to displayit using CSS stylesheets. It is hoped that XSL will be included in the next release. No othercommon browsers support XML to any reasonable degree. This being the case, Internet Explorer5.x is presently the only suitable client for online XML applications. Its powerful scripting support,allowing Java Script to control the loading and parsing of XML documents merely adds weight toits choice.

ChiMeraL - xmldoc_writeup (May 22, 2000) - 10

Page 11: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

3 - Chemical Markup Language

In keeping with the XML philosophy, CML is designed to be extensible and will be published as aseries of drafts. The first draft (CML 1.0)2 contains the core components of this new language anddefines two main types of markup: data holders and chemical components.

3.1 - Data Holders

These are generic elements designed to contain the standard data types required by chemistry.These include strings, integers, floating point numbers and a series of array and matrix elements.These do not usually contain sub-elements, only text, and their context is described by the valuesof their attribute (particularly @title ); e.g. <float title="melting point"

238.2</float> contains the melting point data: 238.2 degrees Centigrade.

The values of these attributes are not defined in the DTD or schema so any data can be containedsimply by choosing them appropriately. This avoids the need to define large numbers of specificelements within CML. To include a molecule's melting point, you could define an additionalelement <meltPoint> but this would not be recognised and formatted by a generic CMLstylesheet. It is possible to build new stylesheets including the new element, but this reduces theircompatibility. The better approach is to use existing <float> and add @title="melting point"

to give it a label. A stylesheet can then display this as 'melting point: ...' without having torecognise the attribute value directly (Figure 3.1.1).

Figure 3.1.1: Simple stylesheet transform (SVG Diagram)

3.2 - Chemical Components

Chemical components represent objects of chemical interest: molecules, atoms, bonds etc. andthese elements normally contain a number of data holders. CML 1.0 defined the basiccomponents for molecular structure and crystallographic markup but the extension of CML toother areas of chemistry requires the addition of additional elements. The number of these shouldbe minimised as much as possible and new elements must be described as an extension to CMLwith their own DTD or schema. In most cases each chemical component should also be suppliedwith a unique identity @id . A molecule containing chemical property information is shown inFigure 3.2.1 (in order to simplify the code, processor instructions, <document> tags andnamespace declarations have been neglected in the following examples).

ChiMeraL - xmldoc_writeup (May 22, 2000) - 11

Page 12: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

Figure 3.2.1: A simple CML block containing property information

<cml title="tetrahydrofuran" id="cml_THF"><molecule title="Furan, tetrahydro-" id="mol_thf_1">

<formula>C4 H8 O</formula><string title="CAS">109-99-9</string><float title="molecular weight">72.1066</float>

<string title="ACX">I1001473</string><string title="DOT">UN 2056</string><string title="RTECS">LU5950000</string>

<float title="melting point" units="degC">-108.3</float><float title="boiling point" units="degC">65</float>

<float title="specific gravity">0.886</float><float title="water solubility" convention="g/100 mL at 23 degC">30</float><string title="comments">

Colorless liquid with an ether-like odor detectable at 2 to 50 ppm.HYGROSCOPIC</string><list title="alternate names">

<string title="name">THF</string><string title="name">1,4-Epoxybutane</string><string title="name">Butylene oxide</string><string title="name">Cyclotetramethylene</string><string title="name">tetramethylene oxide</string><!-- etc -->

</list></molecule>

</cml>

Two other important elements used in CML are <list> , which allows for collections of similardata holders and <link> which supports linking between data in the same, or differentdocuments. Common attributes are @title , @convention , @unit and @builtin . The latteruses a series of fixed values predefined in the DTD (e.g. xyzFract, elementType, atomId) butvalues for the rest depend on convention and published examples. Attempts should be made touse values compatible with those used by others in the field and it is expected that publishedexamples and online projects will help coordinate this. In a similar manner while the DTD definesthe element names that should be used when marking up a chemical object, the tree structure isleft to convention. It is likely that many of these conventions will be based on existing file formats(e.g. MDL's .mol), simply because this makes conversions to and from CML easier. At a laterstage we would expect CML to become a standard export format for chemical software packages.

3.3 - Displaying CML Using a Stylesheet

The CML block in Figure 3.2.1 contains only trivial CML property information. A more complexexample might contain molecular structures (2D and 3D), spectra (ir, uv/vis, nmr or ms) andpossibly a reaction scheme. Techniques for marking up and displaying these more complexcomponents are covered later, but to illustrate how a XSL stylesheet is written a simple examplewill be explained in full (Figure 3.3.1 transforming Figure 3.2.1).

XSL follows XML rules and therefore requires a namespace declaration in the same way as anyother XML document. Internet Explorer recognises http://www.w3.org/TR/WD-xsl 25 as a XSLnamespace declaration and the ability to understand this language is built into the parser. XSLuses a limited set of 'commands' (approximately 15) which can be recognised by their xsl: prefix.All text and all elements not part of the XSL namespace are passed directly to the output (in thiscase, the browser). Two exceptions are, comments (<!-- comment --> ) which the parserignores and the contents of the <xsl:eval> element which are used for scripting and calculationswithin the stylesheet.

ChiMeraL - xmldoc_writeup (May 22, 2000) - 12

Page 13: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

Figure 3.3.1: Simple XSL stylesheet (HTML table)

<?xml version="1.0"?><xsl:stylesheet xmlns:xsl="http://www.w3.org/TR/WD-xsl"><!-- root --><xsl:template match="/"><xsl:apply-templates select="*"/></xsl:template>

<!-- match <cml> and build HTML page --><xsl:template match="cml"><HTML>

<HEAD><TITLE><xsl:value-of select="@title"/> - <xsl:value-of select="@id"/></TITLE></HEAD><BODY><xsl:apply-templates select="molecule"/></BODY>

</HTML></xsl:template>

<!-- build html table --><xsl:template match="molecule">

<!-- Pull out @id="" etc --><TABLE><TR>

<TD>Molecule ID:</TD><TD>Formula:</TD><TD>CAS:</TD><!-- etc. --></TR><TR><TD><xsl:value-of select="@id"/></TD><TD><xsl:value-of select="formula"/></TD><TD><xsl:value-of select="*[@title = 'CAS']"/></TD><!-- etc. --></TD></TR><TR><TD>Alternate Names:</TD><TD COLSPAN="6"><xsl:for-each select="list[@title = 'alternate names']/string[@title='name']"><xsl:value-of select="text()"/>, </xsl:for-each></TD></TR><!-- etc -->

</TABLE></xsl:template></xsl:stylesheet>

The stylesheet contains a series of <xsl:template> s which in turn consist of a mixture of XSLcommands and template text/tags. At the simplest level, each template refers to the particularelement in the XML source that matches the template's @match value. Therefore, <xsl:templatematch="molecule"> marks the start of a template for the <molecule> element. The parsernavigates iteratively through the XML document tree, starting at the document root and selectingmatching templates at each branching. Navigation through the tree is controlled by thexsl:apply-templates select=" value and branches can be parsed or ignored as required.Matching and selection of elements is handled by a powerful pattern matching language whichrefers to the document tree using a Unix like syntax (a full description of pattern matching can befound on the W3C24 and MSDN25 web-sites).

The example shown here (Figure 3.3.1) contains three templates. The first template (@match="/"- the document root) starts the parsing process at the first level element in the document (<cml> inthis example). A similar template is required for all XSL stylesheets. The second template@match="cml" ) builds the start of a skeleton HTML page and put the values of @title and @id

within the TITLE tags. <xsl:apply-templates select="molecule"/> instructs the parser toselect child element(s) <molecule> and to find a template for it. Once that branch has beencompleted, the parser returns to the second template and completes the HTML page:

<HTML><HEAD><TITLE> Furan,tetrahydro- mol_thf_1 </TITLE></HEAD><BODY><!--results

of transforming <molecule> here --></BODY></HTML>

The third template recognises the chemical properties within the <molecule> and formats theirvalues to a simple HTML table. Note how the alternate names are handled; the contents ofxsl:for-each select="list[@title = 'alternate names']/string[@title='name']"> is

ChiMeraL - xmldoc_writeup (May 22, 2000) - 13

Page 14: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

repeated each time the pattern value of @select matches an element in the XML document. Theoutput from the complete stylesheet is shown in Figure 3.3.2

Figure 3.3.2: Displaying CML block as a simple HTML table

Molecule ID: Formula: CAS: ACX: DOT: RTECS: MW:

mol_thf_1 C4 H8 O 109-99-9 I1001473 UN 2056 LU5950000

MP:-108.3 degC BP: 65 degC Spec. Gravity: 0.886 Water Sol.: 30 g/100 mL at 23 degC

Alternate

Names:

THF, 1,4-Epoxybutane, Butylene oxide, Cyclotetramethylene, tetramethylene oxide,

Comments: Colorless liquid with an ether-like odor detectable at 2 to 50 ppm. HYGROSCOPIC

3.4 - Display of CML using Applets

More complex chemical data can not be easily displayed using text and a more sophisticatedsolution is required. A hypothetical CML-aware parser might have this functionality built in, beingable to recognise chemical components and display them appropriately. Using existing parsers,the problem is analogous to that encountered with normal HTML pages. Browser plugins andJava applets are used to provide additional display functionality and a wide range of of theseadd-on programs have been developed, some of them very advanced. An efficient solution is touse a combination of stylesheets and applets (adapted if necessary) to display the CMLcomponents contained within a mixed XML document. Using existing software is preferable asthis allows faster development of a working application and is in the spirit of XML's reusability.

Figure 3.4.1: Applet code produced by the stylesheet and its display

<APPLET NAME="jSpec" CODE="Visua.class"WIDTH="400" HEIGHT="250">

<PARAM NAME="SOURCE" VALUE="##TITLE=Furan, tetrahydro-##JCAMP-DX=4.24##DATA TYPE=MASS SPECTRUM##CAS REGISTRY NO=109-99-9##EPA MASS SPEC NO=61352##MOLFORM=C4H8Oetc.">

</APPLET>

Both plugins and Java applets are primarily intended to read external data files, normally in alegacy format (.mol, .pdb, .dx etc.) and to embed a graphical display of this data within an HTMLpage. In both cases, special tags are required:-

<EMBED NAME="Plugin" SRC="datafile.mol" WIDTH="300" HEIGHT="300">

<APPLET NAME="Applet" CODE="applet.class" WIDTH="300" HEIGHT="300">

CML is normally embedded into a mixed XML document, and therefore it is not easily accessibleby a plugin. Applets are more flexible and can receive data in a wider variety of ways, this makesthem significantly more useful for XML display (until a plugin is written to read CML natively, ashas been done for SVG). Stylesheets can be written to recognise CML components and buildappropriate <APPLET> tags to display them. Since all applets are designed to read legacy formats,the data within the components must also be transformed (called a CML to legacy conversion).The transformed data string can then be passed to the applet in two different ways (depending onwhich the applet supports). The easiest technique is to place the string within the applet's> tag - see .

ChiMeraL - xmldoc_writeup (May 22, 2000) - 14

Page 15: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

An alternative technique is to place the transformed string within a hidden <INPUT> . These arenormally used for HTML forms and can not be seen by the reader. Here it becomes a convenientholder for the transformed string and Java Script is then used to call a public function on theapplet and pass it the contents of the INPUT. The Java Script can be triggered by an HTML buttonor when the page has completely loaded into the browser. This is less direct than using <PARAM> sbut allows more powerful interactions between the page and the applet since other functions canalso be made accessible to Java Script. This approach is used for the JME applet described in thenext chapter (Figure 4.3.1).29

Figure 3.4.3: Data flow in an XML/XSL system (SVG diagram)

A selection of applets have been chosen depending on their suitability for XML integration. It isdesirable that the applets be small (faster downloads), flexible, easy to use and open source(since this allows changes to be made to the applet code if required). XSL 'template fragments'have been built for each of these applets and for several common CML to legacy conversions(available from http://www.ch.ic.ac.uk/chimeral). Existing stylesheets can be upgraded tocomprehend CML by the simple addition of these fragments. By combining these stylesheets, withapplets and the schema complete CML applications can be built (Figure 3.4.3).

ChiMeraL - xmldoc_writeup (May 22, 2000) - 15

Page 16: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

4 - Molecular Structures

The most commonly used file formats for molecular structures are MDL's .mol13 and theBrookhaven protein database .pdb. As its name suggests, the Brookhaven format is normallyused for large molecules or proteins and can be extremely complicated (beyond the scope of thisproject). Small and medium sized molecules (< 500 atoms) often use the .mol format, an exampleof which is shown in Figure 4.1.1.

4.1 - MDL .mol

This uses a strict new-line delimited structure with a heavily implied syntax. As a result it is verysensitive to the amount of white (empty) space on each line. This can cause serious problemswhen converting to and from this format from CML. The format has the advantage of beingextremely terse particularly when compressed (relevant when it was developed but much less sonow). Other file formats tend to use rather similar syntaxes (e.g. .xyz lacks the additional atomcolumns and the bond data).

Figure 4.1.1: Ethanol using the .mol format

ethanolfrom CML

9 81.0303 0.8847 0.9763 C 0 0 0 0 0 0 0 0 0 0 0 01.8847 1.9889 1.5717 C 0 0 0 0 0 0 0 0 0 0 0 03.1883 1.4807 1.7425 O 0 0 0 0 0 0 0 0 0 0 0 00.0000 1.2330 0.8324 H 0 0 0 0 0 0 0 0 0 0 0 00.9949 0.0000 1.6255 H 0 0 0 0 0 0 0 0 0 0 0 01.4097 0.5558 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 01.9050 2.8742 0.9059 H 0 0 0 0 0 0 0 0 0 0 0 01.4753 2.3225 2.5456 H 0 0 0 0 0 0 0 0 0 0 0 03.7056 2.1820 2.1139 H 0 0 0 0 0 0 0 0 0 0 0 0

1 2 1 0 0 0 01 4 1 0 0 0 01 5 1 0 0 0 01 6 1 0 0 0 02 3 1 0 0 0 02 7 1 0 0 0 02 8 1 0 0 0 03 9 1 0 0 0 0

M END

• rows 1-3:Comment lines; sometimes these contain a title, a filename and the source of the file but there is nocommon convention. These lines are normally discarded on converting to CML (note line 3 is blank).

• row 4:The first digit is the total number of atoms in the molecule, the second is the total number of bonds(hydrogen atoms and bonds to them, are sometimes neglected).

• rows 5-13:Each row refers to an atom and gives (in order) Cartesian coordinates x, y and z (in Angstrom) and theelement type (as its periodic table letter). The remaining columns are used for electrons, charges etc.and are not yet included in CML, since mol files rarely use them.

• rows 14-21:Each row refers to a bond. The first two digits refer to the atoms the bond is between (hence 2 3means a bond between the second and third atoms in the list above) and the third digit refers to theempirical bond order (2 = double bond etc.) Again, the remaining columns are rarely used.

• row 22:M END indicates the end of the molecule. Files that contain multiple .mol molecules (e.g. reaction .rxnor .sd files) also include a delimiter between molecules ($MOL or $$$$).

ChiMeraL - xmldoc_writeup (May 22, 2000) - 16

Page 17: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

4.2 - CML: molecule

CML molecular structures are reminiscent of the .mol format but with the data being fully markedup and explicitly defined. Implied syntax is reduced to a minimum and conventions are declarednot assumed. The aim is to make each chemical component (coordinate, atom, bond etc.)completely separable from the rest of the document, allowing components to be easily added orremoved without destroying the document tree. Since an XML document might contain a verylarge number of molecules (examples containing over 600 molecules have been built), eachcomponent requires a unique @id. A large collection of mol files can be easily converted to CMLand then concatenated to a single XML document. A stylesheet can then pick out a singlemolecule and display its structure by searching against @id .

Figure 4.2.1: Example of small molecule markup - see Appendix B for alternative methods

..<molecule title="ethanol" id="mol_ethanol">

<formula>C2 H6 O</formula><string title="CAS">64-17-5</string><!-- etc --><list title="atoms">

<atom id="ethanol_a_1"><integer builtin="atomId">1</integer><float builtin="x3" units="A">1.0303</float><float builtin="y3" units="A">0.8847</float><float builtin="z3" units="A">0.9763</float><string builtin="elementType">C</string>

</atom><!-- eight further atoms -->

</list><list title="bonds">

<bond id="ethanol_b_1"><integer title="bondId">1</integer><integer builtin="atomRef">1</integer><integer builtin="atomRef">2</integer><integer builtin="order" convention="MDL">1</integer>

</bond><!-- seven further bonds -->

</list></molecule>..

As with .mol, lists of atoms and bonds are used. Atoms contain either 3 (x3, y3, z3) or 2 (x2, y2)Cartesian coordinates, an atom type and an 'atomID'. This is a non-unique label referenced by thebond's 'atomRef' values and not the same as the atom's unique @id (hence <integer

builtin="atomRef">1</integer> refers to an atom of <integer

/integer> not @id="1" ). The same applies to 'bondId'. While not optimal, this greatly eases theconversion of CML to and from legacy formats. Coordinate units are no longer implied and sincethis molecule was converted from a .mol file, the bond order convention is 'MDL'.

Additional information (formula, CAS number) is not extracted from the mol file (while suchinformation might be supplied using comments, it is not automatically identifiable) but from variousonline databases - in particular ChemFinder.30 The author may decide to include any informationthey wish in this way, simply by adding additional elements at appropriate places in the CML.

4.3 - Java Molecular Editor (JME)

The JME (Java Molecular Editor) applet has been developed as part of online chem-informaticssystem at Novartis Crop Protection AG in Basel.29 It is a simple and extremely small (31K) 2Deditor, incorporating a sophisticated SMILES calculator (SMILES is a standard for producing textdescriptions of a molecule's structure). Rather than include filters for legacy file formats andgreatly increasing the applet's size, JME reads a simple coordinate and bond string. The syntax

ChiMeraL - xmldoc_writeup (May 22, 2000) - 17

Page 18: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

used is unique to JME and since there is no way to dynamically calculate it from a standard datafile, is a problem for HTML solutions. Using CML, this problem is solved with a trivial stylesheettransformations. This makes the applet excellently suited for our purposes (Figure 4.3.1).

Figure 4.3.1: JME editor with Java Script integration - toolbars used for editing, left click/drag to translate, right click/drag to rotate

Figure 4.3.3: JME stylesheet fragment - reformatted for clarity

..<!-- match <cml> and build HTML page --><xsl:template match="cml"><HTML>

<HEAD><TITLE><xsl:value-of select="@title"/> - <xsl:value-of select="@id"/></TITLE></HEAD><BODY ONLOAD="document.JME.readMolecule(jmeoutput.value)"><xsl:apply-templates select="molecule"/></BODY>

</HTML></xsl:template>..<!-- match molecule and display using JME --><xsl:template match="molecule">

<APPLET CODE="JME.class" NAME="JME" ARCHIVE="JME.jar" WIDTH="400" HEIGHT="300">You have to enable Java and JavaScript on your machine !</APPLET><!-- hidden form element contains the JME data string --><xsl:element name="INPUT">

<xsl:attribute name="NAME">jmeoutput</xsl:attribute><xsl:attribute name="TYPE">hidden</xsl:attribute><xsl:attribute name="VALUE">

<!-- select last atom and calculate its number --><xsl:for-each select="list/atom[end()]" xml:space="preserve">

<xsl:eval>formatIndex(childNumber(this), "1")</xsl:eval></xsl:for-each><!-- select last bond and calculate its number --><xsl:for-each select="list/bond[end()]" xml:space="preserve">

<xsl:eval>formatIndex(childNumber(this), "1")</xsl:eval></xsl:for-each><!-- select each atom and extract coordinates and atom type --><xsl:for-each select="list/atom" xml:space="preserve">

<xsl:value-of select="*[@builtin = 'elementType']"/><xsl:value-of select="*[@builtin = 'x3']"/><xsl:value-of select="*[@builtin = 'y3']"/>

</xsl:for-each><!-- select each bond and extract atom refs and bond order --><xsl:for-each select="list/bond" xml:space="preserve">

<xsl:value-of select="*[@builtin = 'atomRef']"/><xsl:value-of select="*[@builtin = 'order']"/>

</xsl:for-each></xsl:attribute>

</xsl:element></xsl:template>..

The string read by JME uses the syntax;

n_atoms n_bonds {atomic_symbol x_coord y_coord}per atom {atom1 atom2

}per bond

This string is built by the stylesheet and held in hidden <INPUT> labelled with the name'jmeoutput'. A Java Script 'onLoad' function is also added to the HTML's <BODY> tag and this iscalled when the page has been completely loaded into the browser. The script takes the content

ChiMeraL - xmldoc_writeup (May 22, 2000) - 18

Page 19: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

of 'jmeoutput' and passes it to a public 'readMolecule' function on the applet which proceeds todisplay it. The 'Get SMILES' and 'Get JME Source' buttons work in a similar fashion.Stereoisomers and simple reactions can also be displayed using JME but haven't yet beenincorporated into CML.

Some additional XSL commands should be explained at this point; <xsl:element> andxsl:attribute> create new XHTML/XML tags in the output and are used as an alternative towriting the tags directly. For example, <xsl:element name="INPUT"><xsl:attribute

name="NAME">jmeoutput</xsl:attribute></xsl:element> is equivalent to <INPUT

NAME="jmeoutput"/> . They are used to build complex tags that might otherwise cause thestylesheet to become invalid XML.

The <xsl:eval>formatIndex(childNumber(this), "1")</xsl:eval> command uses IE 5specific functions to calculate the position of a element with respect to its siblings and return thisas a formatted number. It is used in almost all the stylesheet fragments, normally to count the totalnumbers of atoms and bonds in the molecule. Non IE stylesheets would use different functionsdepending on the parser they are designed for.

4.4 - Jmol 3D Viewer

Jmol31 is a stand alone Java application (not an applet) being developed by the Open Scienceproject.32 It is designed to read and display molecules using a large variety of legacy formats. Itsfunctionality is similar to the Chime plugin but it is open source and platform independent. Anexperimental applet has been released and this has been adapted for use in this project. Astylesheet converts the CML <molecule> to a % delimited .xyz string, which is then stored andtransferred to the applet in a manner similar to JME. The '%' delimiters must be added becausethe converted string can not replicate the line breaks required in the .xyz format (similar to .mol).The applet has been slightly rewritten to recognise % as equivalent to a new line.

Figure 4.4.1: Jmol 3D viewer and .xyz format - click/drag to rotate, shift-c/d to zoom, ctrl-c/d to translate, TAB changes drawing style, Ladds atom labels

17histamine from CML

C -1.7560 0.0000 0.3080C -0.3600 0.0760 0.2880N 0.1400 -1.1520 0.5080C -0.9160 -1.9680 0.6520N -2.0640 -1.2800 0.5320C 0.4520 1.3600 0.0640C 1.9560 1.0720 0.1080N 2.7360 2.3040 -0.1040H -2.4520 0.8120 0.1640H -0.8520 -3.0320 0.8400H 0.1880 2.0880 0.8320H 0.1840 1.7800 -0.9040H 2.2240 0.3480 -0.6600H 2.2240 0.6560 1.0840H 2.4960 3.0320 0.6680H 2.4920 2.7240 -1.0840H -3.0120 -1.7840 0.6240

Problems also occur with white space, .xyz and .mol require strict space handling to ensure theirdata remains in straight columns. Both formats use right aligned text, in contrast to XML/HTMLstandard left alignment:

ChiMeraL - xmldoc_writeup (May 22, 2000) - 19

Page 20: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

right: C -1.7560 0.0000 0.3080% left: C -1.7560 0.0000 0.3080%C -10.3600 3.0760 5.2880% C -10.3600 3.0760 5.2880%H 8.1400 -11.1520 7.5080% H 8.1400 -11.1520 7.5080%

In both cases, the stylesheet must check each coordinate and add a varying amount white spacein front of it depending on its size and sign (+ or -) - unfortunately this complication is unavoidable.As with JME, the converted string is contained within a named <INPUT> and then passed byonLoad="document.JMolApplet.setModelToRenderFromXYZString(jmoloutput.value,'T')"

to the applet.

Figure 4.4.3: Jmol stylesheet fragment - reformatted for clarity

..<xsl:template match="molecule">

<!-- build applet --><APPLET CODE="org.openscience.miniJmol.JmolApplet.class"

NAME="JMolApplet" ARCHIVE="JmolApplet.jar"WIDTH="300" HEIGHT="300">

You have to enable Java and JavaScript on your machine !</APPLET><xsl:for-each select="id(@idref)"><!-- build INPUT containing the xyz source --><xsl:element name="INPUT">

<xsl:attribute name="NAME">jmoloutput</xsl:attribute><xsl:attribute name="TYPE">hidden</xsl:attribute><xsl:attribute name="VALUE"><!-- start convertion to .xyz --><!-- select last atom and add its number --><xsl:for-each select="list/atom[end()]"xml:space="preserve"><xsl:eval>formatIndex(childNumber(this), "1")</xsl:eval></xsl:for-each>%<!-- add title --><xsl:value-of select="@title"/> from CML%<xsl:for-each select="list/atom" xml:space="preserve">

<xsl:for-each select="string[@builtin='elementType']"><xsl:value-of select="."/><!-- single letter elements with require an additional space --><xsl:if test=".[. = 'H']"> </xsl:if><xsl:if test=".[. = 'B']"> </xsl:if><xsl:if test=".[. = 'C']"> </xsl:if><xsl:if test=".[. = 'N']"> </xsl:if><xsl:if test=".[. = 'O']"> </xsl:if><!-- etc -->

</xsl:for-each><!-- for each atom, extract coord adding white space as required --><xsl:if test="float[@builtin='x3' and . $lt$ 10]"> </xsl:if><xsl:if test="float[@builtin='x3' and . >= 0]"> </xsl:if><xsl:value-of select="float[@builtin='x3']"/><!-- repeat for y3 and z3 -->%</xsl:for-each><!-- finish convertion to .xyz --></xsl:attribute>

</xsl:element></xsl:template>..

4.5 - Marvin and Structure Drawing Applet (SDA)

Marvin is an commercial applet produced by Chemaxon33 and is free to academic users. It is awire frame 3D viewer able to accept .mol data using external files, as a <PARAM> or via JavaScript.The applet is very configurable and includes a 2D editing function. SDA (Structure Drawing) is a freeware editor developed by ACD Labs and also accepts .mol data via <PARAM> .34

ChiMeraL - xmldoc_writeup (May 22, 2000) - 20

Page 21: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

Figure 4.5.1: Marvin (left) and SDA applets - Marvin: left click/drag to rotate, shift-lc/d to zoom, ctrl-lc/d to translate, right click for menu -SDA: tool bars contain editing tools, applet can be 'floated' to a separate window by clicking the top left button

Figure 4.5.3: Marvin stylesheet fragment - reformatted for clarity

..<xsl:template match="molecule">

<APPLET CODE="MView" ARCHIVE="marvin.jar" WIDTH="300" HEIGHT="300"><xsl:element name="PARAM">

<xsl:attribute name="NAME">mol</xsl:attribute><xsl:attribute name="VALUE">

<!-- start conversion to .mol --><xsl:value-of select="@title"/>\

from CML\\<!-- select last atom, calculate its number and required white space --><xsl:for-each select="list/atom[end()]/integer[@builtin='atomId']" xml:space="preserve">

<xsl:if test=".[text() $lt$ 100]"> </xsl:if><xsl:if test=".[text() $lt$ 10]"> </xsl:if><xsl:for-each select="../../atom[end()]">

<xsl:eval>formatIndex(childNumber(this), "1")</xsl:eval></xsl:for-each>

</xsl:for-each><!-- select last bond, calculate its number and required white space --><xsl:for-each select="list/bond[end()]/integer[@title='bondId']" xml:space="preserve">

<xsl:if test=".[text() $lt$ 100]"> </xsl:if><xsl:if test=".[text() $lt$ 10]"> </xsl:if><xsl:for-each select="../../bond[end()]">

<xsl:eval>formatIndex(childNumber(this), "1")</xsl:eval></xsl:for-each>

</xsl:for-each>\<!-- for each atoms, extract coord and type, with required white space --><xsl:for-each select="list/atom" xml:space="preserve">

<xsl:if test="float[@builtin='x3' and . $lt$ 10]"> </xsl:if><xsl:if test="float[@builtin='x3' and . >= 0]"> </xsl:if><xsl:value-of select="float[@builtin='x3']"/><!-- repeat for y3 and z3 --><xsl:for-each select="string[@builtin='elementType']">

<xsl:value-of select="."/><!-- elements with single letter names require an additional space --><xsl:if test=".[. = 'H']"> </xsl:if><xsl:if test=".[. = 'B']"> </xsl:if><xsl:if test=".[. = 'C']"> </xsl:if><xsl:if test=".[. = 'N']"> </xsl:if><xsl:if test=".[. = 'O']"> </xsl:if><!-- etc. --> 0 0 0 0 0 0 0 0 0 0 0 0\

</xsl:for-each></xsl:for-each><!-- for each bond, extract atomRefs and order, with required white space --><xsl:for-each select="list/bond" xml:space="preserve">

<xsl:for-each select="integer[@title='atomRef']"><xsl:if test=".[. $lt$ 100]"> </xsl:if><xsl:if test=".[. $lt$ 10]"> </xsl:if><xsl:value-of select="."/>

</xsl:for-each><xsl:value-of select="integer[@builtin='order']"/> 0 0 0 0\

</xsl:for-each>M END<!-- finished convertion to mol -->

</xsl:attribute></xsl:element>

You must have Java turned on!</xsl:element>

</xsl:template>..

The stylesheets for these applets are very similar and use the <PARAM> mechanism. As with Jmol,line delimitation is required for the mol file and is built into both applets (Marvin uses "\" and SDA"|"). Like .xyz, handling white space is complex particularly since Marvin in particular is very

ChiMeraL - xmldoc_writeup (May 22, 2000) - 21

Page 22: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

intolerant of errors. The following XSL fragment works for most small molecules but a more robustversion will need to be developed when more features are available in Internet Explorer's XSLengine (in particular the ability to compare of data from different elements).

Experiments have shown that the CML and Marvin applet scales unexpectidly well and candemonstrably display over 300 molecules on a modern PC.

ChiMeraL - xmldoc_writeup (May 22, 2000) - 22

Page 23: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

5 - Spectra

Spectra are regularly stored and exchanged using an electronic format and modern spectrometerscan save their data directly to disk. Electronically stored spectra can be easily indexed, searchedagainst or otherwise manipulated and large databases of spectra files exist online (e.g. the NISTWebbook).35 As with molecular structures, a number of legacy formats are in use the mostcommon being JCAMP-DX50 and SPC.36 One of the aims of this project was to extend existingCML to incorporate spectra markup and to develop techniques for displaying it.

5.1 - JCAMP-DX/CS

The JCAMP-DX format defines a series of conventions for the storage and exchange of spectraldata between spectrometers, lab and personal computers. Files using this format are written usingASCII and contain a series of labelled data records (LDR). Each LDR is delimited by a new lineand a double hash:

##LDRname= value

While having a much simpler structure than XML, these LDRs can be considered as a form ofmarkup with the LDR name (data label) being equivalent to the element name. This isunsurprising since the two formats share a number of important design criteria, such asextensibility, human/machine readable, flexibility and platform independent. Approximately 25reserved data labels were originally defined for JCAMP-DX. These are intended to contain:

• Metadata - (e.g. ##ORIGIN, ##OWNER, ##DATE)• Header information - (e.g. ##TITLE, ##DATA TYPE, ##BLOCKS, ##END)• Required spectral parameters - (e.g. ##XUNITS, ##FIRSTX, ##MAXX)• Optional spectral parameters - (e.g. ##RESOLUTION, ##DELTAX)• Spectral data - (e.g. ##XYDATA, ##XYPOINTS, ##PEAK TABLE, ##PEAK ASSIGNMENTS)• Equipment - (e.g. ##INSTRUMENT PARAMETERS)• Sample information - (e.g. ##CAS NAME, ##MOLFORM, ##MP)

Figure 5.1.1: Tetrahydrofuran mass spectrum using the JCAMP-DX format

##TITLE= Furan, tetrahydro-##JCAMP-DX= 4.24##DATA TYPE= MASS SPECTRUM##XUNITS= M/Z##YUNITS= RELATIVE ABUNDANCE##FIRSTX= 24##LASTX= 73##FIRSTY= 4##MINY= 3##MAXY= 3##NPOINTS= 32##PEAK TABLE= (XY..XY)24, 4 25, 30 26, 171 27, 1545 29, 792 30, 21 31, 258 33, 5 34, 26 35, 10 37, 105 38, 165 39, 1182 40, 906 41, 4060 42,9999 43, 1586 44, 325 45, 100 46, 3 49, 7 50, 11 51, 19 53, 27 54, 10 55, 6 68, 7 69, 20 70, 107 71, 2557 72, 2868 73, 147##END=

An example of a simple JCAMP-DX file is given in Figure 5.1.1. Important LDRs are #DATA TYPEwhich describes the type of spectrum and ##PEAK TABLE that contains the data points. Avariable list label following ##PEAK TABLE describes how the data is tabulated. Data is groupedinto XY or XYZ/XYM coordinates and separated by commas within the group, while groups areseparated by semicolons or spaces. Any additional white space is ignored. Common variable listsare;

ChiMeraL - xmldoc_writeup (May 22, 2000) - 23

Page 24: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

• (XY..XY)Grouped XY pairs.23,102 19,89 15,87 ..

• (XYZ..XYZ) or (XYM)Grouped XYZ or XYM (can be used to give multiplicity).23,102,S 19,89,S 15,87,D ..

• (X++(Y..Y))More complex. Each line starts with an X value, then a series of Y value at equal X spacings, so -12 52 48 46 43 4222 43 38 36 35 31is equivalent to -12,52 14,48 16,46 18,43 20,42 22,43 24,38 26,36 28,35 30,31

JCAMP-DX also allows the use of user defined data labels. These are distinguished by a $ prefixand are specific to a particular instrument, location, user etc. This effectively means that JCAMPis an extensible language but does not include any procedure for systematically describing aLDR's meaning. This seriously limits the use of user defined data labels beyond the originalauthor. A number of JCAMP 'sub-languages' have been defined in an attempt to solve this.37 38

These contain lists of LDRs intended for particular areas of spectroscopy (e.g. UV/Vis, H1 NMR,MS). Particularly interesting is JCAMP-CS39 which allows the addition of limited structuralinformation. These JCAMP files are split into a number of separate blocks, each containing adifferent type of information (e.g. molecular structure, peak table and peak assignments). Thisinformation can be extracted and marked up as CML, and a JCAMP-CS to CML converter hasbeen written for this purpose.

5.2 - CML: spectrum

Since a well established (if non-XML) markup language already exists, many technical difficultieshave already been solved. Rather than develop new conventions for spectral markup, we havechosen to incorporate those used by JCAMP. It is expected that most users will wish to convertCML to and from JCAMP so this needs to be as easy as possible. Once converted, spectra canincorporated into XML documents and displayed using techniques similar to those used formolecules.

Comparing Figure 5.1.1 (JCAMP) with Figure 5.2.1 illustrates the approach used. A new element<spectrum> is defined as an extension to CML, and this is given a new namespace chimeral: .The contents of JCAMP's labelled data records are then mapped to a <string> or <float> withan appropriate @title value. Reserved LDRs (##TITLE, ##XUNIT etc.) are directly recognised andgiven lower case values to better comply with XML conventions. User-defined or unrecognisedLDRs can either be ignored or, more correctly, given @title="LDRname" . This avoids anyinformation loss on converting JCAMP to CML.

ChiMeraL - xmldoc_writeup (May 22, 2000) - 24

Page 25: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

Figure 5.2.1: Example of spectra markup - see Appendix B for more details

..<chimeral:spectrum title="Furan, tetrahydro-" id="spect_furantetrah_ir_1"

convention="JCAMP-DX= 4.24"><string title="datatype">MASS SPECTRUM</string><string title="EPA">61352</string><string title="origin">D.HENNEBERG, MAX-PLANCK INSTITUTE, MULHEIM, WEST GERMANY</string><string title="owner">NIST Mass Spectrometry Data Center</string><float title="xunits">M/Z</float><float title="yunits">RELATIVE ABUNDANCE</float><float title="firstx" convention="M/Z">24</float><float title="lastx" convention="M/Z">73</float><float title="xfactor">1</float><float title="firsty" convention="RELATIVE ABUNDANCE">4</float><float title="miny" convention="RELATIVE ABUNDANCE">3</float><float title="maxy" convention="RELATIVE ABUNDANCE">9999</float><float title="yfactor">1</float><float title="npoints">32</float><list title="peak table" convention="(XY..XY)">

<coordinate2 id="furantetrah_ir_c_1">24, 4</coordinate2><coordinate2 id="furantetrah_ir_c_2">25, 30</coordinate2><coordinate2 id="furantetrah_ir_c_3">26, 171</coordinate2><coordinate2 id="furantetrah_ir_c_4">27, 1545</coordinate2><!-- 28 more coordinate2s -->

</list></chimeral:spectrum>..

The delimiters between data groups and between data within a group are only loosely defined inthe JCAMP specifications. As a result, the syntax used for ##XYDATA and ##PEAKTABLE isextremely variable. In particular, while the (X++(Y..Y)) convention is common, it is distinctlyconfusing for the non specialist and is extremely difficult to store using XML elements. It issuggested that CML spectra should only use the (XY..XY) or (XYM)/(XYZ) conventions and witheach pair or triplet be marked up using <coordinate2> and <coordinate3> respectively. Datawithin these elements is separated by the use of a comma and a white space (', '). More complexconventions are deconvoluted during the JCAMP to CML conversion process, e.g. (X++(Y..Y)) isexpanded to (XY..XY).

Since a huge number of LDRs have been defined, we have chosen to include only a smallnumber of generic ones in this project. If an author wishes to use additional LDRs, thse need to beincluded into the converter and stylesheets.

5.3 - JSpec - JCAMP and SPEC display applet

Jspec is an open source applet written by Guillaume Cottenceau.40 It is small (31K) and wasdesigned to display JCAMP and SPC spectra files. It also incorporates a number of usefulfeatures, including zooming, peak finding and integration.

Figure 5.3.1: JSpec spectra viewer showing caffeine mass and UV/Vis spectra - click drag to select an area then click buttons

The stylesheet is similar to that used for the Marvin applet. CML is converted to a (cleaned up)JCAMP text string and this is built into the applet's <PARAM> . Line delimitation is not required

ChiMeraL - xmldoc_writeup (May 22, 2000) - 25

Page 26: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

since JCAMP already has a ## separator, but the applet was slightly adapted to acceptcomma-separated ##PEAKTABLE and ##XYDATA groups. Infra red spectra can be reversed (Xaxis) but not inverted (Y axis) this functionality might be built into the applet at a later date.

Figure 5.3.3: Jspec stylesheet fragment - reformatted for clarity

..<!-- you must include the namespace in the template match --><xsl:template match="chimeral:spectrum"><!-- build applet -->

<APPLET CODE="Visua.class" WIDTH="400" HEIGHT="250"><xsl:element name="PARAM">

<xsl:attribute name="NAME">SOURCE</xsl:attribute><xsl:attribute name="VALUE" xml:space="preserve">

<!-- start conversion to JCAMP --><!-- * means 'any' within a pattern match -->##TITLE= <xsl:value-of select="@title"/>##<xsl:value-of select="@convention"/>##DATA TYPE= <xsl:value-of select="*[@title='datatype']"/>##XUNITS= <xsl:value-of select="*[@title='xunits']"/>##YUNITS= <xsl:value-of select="*[@title='yunits']"/>##FIRSTX= <xsl:value-of select="*[@title='firstx']"/>##LASTX= <xsl:value-of select="*[@title='lastx']"/>##FIRSTY= <xsl:value-of select="*[@title='firsty']"/>##MINY= <xsl:value-of select="*[@title='miny']"/>##MAXY= <xsl:value-of select="*[@title='miny']"/>##NPOINTS= <xsl:value-of select="*[@title='npoints']"/><xsl:for-each select="*[@title='peak table' and @convention='(XY..XY)']">##PEAK TABLE= (XY..XY)<xsl:for-each select="coordinate2"><xsl:value-of/>,</xsl:for-each></xsl:for-each><!-- repeat for 'xypairs (XY..XY)' and 'peak table (XYM)' -->##END=<!-- end conversion' -->

</xsl:attribute></xsl:element><!-- Reverses IR spectra --><xsl:if test="*[@title='datatype' and .='INFRARED SPECTRUM']">

<PARAM NAME="WAY" value="REVERSE"/></xsl:if>

</APPLET></xsl:template>..

Further development is required to include peak assignment tables into the CML <spectrum> .We would also like to develop mechanisms by which applets can communicate with each otherand the XML document (probably via Java Script). One can envisage clicking on a peak in aspectrum and having the parser look it up in a CML peak assignments table, it would thenhighlight the appropriate functional group on a molecular display.

5.4 - Scalable Vector Graphics (SVG)

SVG (scalable vector graphics) is an XML language used to describe vector images.5 Mostimages on the web use bitmapped formats (.gif, .jpg, .png), which use a grid of coloured pixels toform the picture. In contrast, vector images use Cartesian descriptions of drawing objects (lines,polygons, fills, text) and overlay them onto a blank 'canves'. Vector formats are most efficient forsimple line images and diagrams, where the files they produce will be significantly smaller thanthe equivalent bitmap. Since the drawing objects are described mathematically, vector images canalso be zoomed and resized without pixellating.

Vector formats are particularly suited to XML markup and SVG is a language under rapiddevelopment by the W3C. Recently added functionality includes shaded fills, colour transparencyand embedded true type fonts. Filters allowing files to be exported as SVG are available for CorelDraw41 and Adobe Illustrator (popular drawing packages), and Adobe has released an SVG6 for Internet Explorer and Netscape Navigator.

ChiMeraL - xmldoc_writeup (May 22, 2000) - 26

Page 27: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

Figure 5.4.1: SVG molecule and infra red spectrum - right click for menu

The examples given in Figure 5.4.1 were produced from CML by Peter Murray Rust, using the XTparser. SVG promises to be an excellent alternative to applets for the display of static molecules,reactions and spectra. Unfortunately the IE 5 XSL engine does not yet include the functionsnecessary to carry out CML to SVG stylesheet transformations. It is hoped that later releases ofthe browser will rectify this. In addition, the Adobe plugin can only accept external SVG files and isunable to display SVG incorporated into an XML document. One solution would be to use an SVGapplet, but these are still being developed.42 43

Several diagrams and flow charts in this report use SVG; see Figure 1.2.1, Figure 2.1.1, Figure3.4.3 (right click on the diagrams and select 'zoom in').

ChiMeraL - xmldoc_writeup (May 22, 2000) - 27

Page 28: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

6 - Reactions

Element <reaction> was included in the CML DTD but not elaborated further. The techniquespresented here are experimental and further work is required to develop robust and flexiblereaction markup. It is hoped that these explorations might provide a starting place for futuredevelopment.

Reaction markup could potentially get extremely complicated, particularly if atom mapping orfunctional groups are included. For our initial experiments, we have concentrated on using CML todisplay simple reaction schemes within the IE browser. This can be considered as an extension of<molecule> markup and is approached in a similar way.

6.1 - Linking via @id

A reaction is considered as a series of molecular species; potentially including reactants,products, reagents, intermediates, transition states etc. These species are then grouped intoreaction steps with suitable information given at each step (e.g. reaction conditions, yield, reactionname). If the structure of each species is expressed as a CML <molecule> and supplied with aunique @id value, then a reaction step need only contain references to each species' @id (via@href="idname"> ) and need not contain the entire molecule. This keeps the CML succinct,enhancing human readability. It also enhances the reusability of chemical components; a library ofreactions using the same reactive species needs only contain one copy of its structure. A exampleof a simple CML reaction is given in Figure 6.1.1 with the molecular structures omitted. A fulldescription of reaction markup can be found in Appendix B.

The attribute names @id and @href are those chosen as linkers for CML. Other languages mightuse different attributes and the parser needs which these are. IE uses three data type declarationsin the schema for this purpose. An attribute declared as dt:type="id" contains an element'sunique identity, dt:type="idref" contains a reference and dt:type="idrefs" may contain oneor more references (space separated). The parser is required to ensure all identities containunique strings and that all references refer to an existing identity.

ChiMeraL - xmldoc_writeup (May 22, 2000) - 28

Page 29: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

Figure 6.1.1: Simplified 'stepwise' reaction - Diels Alder Cycloaddition

..<cml title="Simple Reaction" id="cml_simple">

..<reaction title="Diels-Alder cycloaddition" id="simple_rxn_1" convention="stepwise">

<!-- overall information --><string title="description">Simple example of a A + B -> C reaction.</string><float title="yield" units="%">88</float><string title="notes">taken from Vollhardt and Schore</string><list title="reactionStep" id="simple_s_1">

<!-- reaction step information --><string title="description">cycloaddition</string><string title="notes">one step</string><!-- series of links to each reaction species --><link title="reactant" href="simple_mol_reactant1"/><link title="reactant" href="simple_mol_reactant2"/><!-- reagent could contain: a reagent name,

link to a structure or as here, a list ofconditions. multiple reagents can be includedif needed -->

<link title="reagent"><integer title="index">1</integer><string title="solvent">Acetonitrile</string><string title="temperature" convention="degC">100</string><string title="duration" convention="hours">3</string><string title="notes">reflux</string>

</link><link title="reagent">

<integer title="index">2</integer><string title="notes">workup</string>

</link><link title="product" href="simple_mol_product"/><!-- could also include catalyst, intermediate,

transition state etc. --></list>

</reaction>..<!-- the actual molecules can be anywhere, even in a different cml block --><molecule title="2,3-Dimethyl-1,3-butadiene" id="simple_mol_reactant1"><!-- etc. --></molecule>..<molecule title="Propenal" id="simple_mol_reactant2"><!-- etc. --></molecule>..<molecule title="Diels-Alder adduct" id="simple_mol_product"><!-- etc. --></molecule>..

</cml>..

Links of this sort are particularly useful for comparing or indexing two sets of data against eachother. Further chemical examples might be atom mapping (in a reaction) or peak assignment for aspectrum. Significantly more advanced linking techniques called XLink44 and XPointer45 are beingdeveloped by the W3C. These allow for one or two way linking between documents and complexone to many or many to one links. Using these techniques, a <reaction> in a XML documentcould reference data-sheets in a chemical archive. By clicking on a reaction species, the readerwould then have access to full chemical data for that molecule.

ChiMeraL - xmldoc_writeup (May 22, 2000) - 29

Page 30: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

Figure 6.1.2: Stepwise epoxidation reaction (directly converted from a rxn file)

Two stylesheets have been developed to display CML reactions. Both use the Marvin applet (in2D mode) to display the reaction species and an HTML table to format the applets into a reactionschema (the arrows are normal gif images). Additional information (reaction conditions etc. ) isdisplayed above each reaction step or below the arrow, as appropriate. The first stylesheetFigure 6.1.3) is able to display multiple reaction steps of the form A + B + .. -> X + Y + ..

Figure 6.1.3: Stepwise reaction stylesheet fragment - reformatted for clarity

..<!-- use this template for a stepwise reaction --><xsl:template match="reaction[@convention='stepwise']">

<!-- build HTML table --><TABLE><TR>

<TD COLSPAN="2"><B>Stepwise Reaction:</B> <xsl:value-of select="@title"/></TD><TD><B>id:</B> <xsl:value-of select="@id"/></TD>

</TR><!-- etc. for 'description', 'yield' and 'notes' --><!-- build a reaction scheme for each step --><xsl:for-each select="list[@title='reactionStep']"><TR><TD COLSPAN="3">

<TABLE><TR><TD>

<!-- select each link to a reactant, then find and select the elementswith matching @id (this is what id(@href) does) --><xsl:for-each select="link[@title='reactant']">

<xsl:for-each select="id(@href)"><!-- call a template that builds Marvin applet --><xsl:apply-templates select="list[./atom/float/@builtin='x3']"/>

</xsl:for-each></xsl:for-each></TD><TD><!-- the arrow is a normal gifs --><IMG SRC="writeupdata/arrow.gif" /><BR/><!-- select each reagent in turn --><xsl:for-each select="link[@title='reagent']">

<xsl:if test="*[@title='index']"><!-- if an index number is given, include it --><xsl:value-of select="integer[@title='index']"/>)

</xsl:if><xsl:for-each select="id(@href)">

<!-- if a structure is given, include it --><xsl:apply-templates select="list[./atom/float/@builtin='x3']"/>,

</xsl:for-each><xsl:for-each select="string[@title='solvent']">

<!-- if a solvent is given, include it --><xsl:value-of select="text()"/>,

</xsl:for-each><!-- etc. for temperature, duration and notes -->

</xsl:for-each></TD><TD><!-- select each link to a product --><xsl:for-each select="link[@title='product']">

<xsl:for-each select="id(@href)"><!-- call a template that builds the Marvin applet --><xsl:apply-templates select="list[./atom/float/@builtin='x3']"/>

</xsl:for-each></xsl:for-each></TD></TR></TABLE>

</TD></TR></xsl:for-each></TABLE>

</xsl:template>..

The code to convert the molecular information to .mol and build the Marvin applet is almostidentical to that described in Chapter 4 and has been omitted. The stylesheet is significantly more

ChiMeraL - xmldoc_writeup (May 22, 2000) - 30

Page 31: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

structured than previous examples, and contains larger amounts of HTML. In particular therequirement for TABLE tags restricts the flexibility of the stylesheet. More complex reactionswould require dedicated stylesheets written specially for them. An example is the catalytic cycleshown in Figure 6.1.4 Alternative approaches would be to use the JME applet (which can handlelimited reactions), or SVG.

Figure 6.1.4: Cross Metathesis (example of a catalytic cycle)

ChiMeraL - xmldoc_writeup (May 22, 2000) - 31

Page 32: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

7 - Converting Legacy Formats to CML

Manually creating CML from scratch is unfeasibly laborious and so alternatives are needed. It ishoped that future chemistry software will include filters enabling them to save data directly toCML, but until this occurs the best solution is to convert existing legacy formats to CML. This hasthe advantage of 'normalising' the existing range of diverse chemical files to a single universalformat. The extensibility of XML (and hence CML) and the use of generic data holders means thatany ASCII based legacy file can be converted to CML without information loss.

CML has been designed to make these conversions as easy as possible. In most cases (e.g. .molor JCAMP) this involves identifying blocks of data and wrapping them in the appropriate CMLelements. This is equivalent to a complex text search and replace operation. Perl (a widely usedtext manipulation and report language) is an excellent tool for carrying out exactly this sort ofmanipulation.46 It is open source and whilst originally designed for Unix, versions are available forall major operating systems. Perl scripts are plain text files and are normally run from thecommand line. With slight alterations, they can also be be run as CGI (common gateway) programs on a web server. This allows text written into a online <FORM> to be sent to the server,converted and returned to the user as an HTML page. This is demonstrated in Figure 7.1.

Warning: you need to be online and the server-returned page will overwrite this one

Figure 7.1: Server side conversion of .mol, .rxn and .dx files to CML - http://www.ch.ic.ac.uk/chimeral/resources/file2cml.html

A range of perl converters have been written and are available for use from the ChiMeraL site.Perl can not easily access compressed data and since both .mol and JCAMP files sometimes use

ChiMeraL - xmldoc_writeup (May 22, 2000) - 32

Page 33: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

compressed sections, they must be expanded before conversion. Describing the workings of eachconverter is beyond this report and interested parties should read the script comments. Availableconverters include;

• MDL .mol to CML• MDL .mol to the JME input string• .sd (archive format) to CML• .xyz to CML• JCAMP .dx and .cs to CML (JCAMP-CS files produce both a <molecule> and a <spectrum> )• REACSS .rxn (reaction format based on .mol) to CML

Converters for Gaussian and Mopac source files (amongst others) are under development.47 Weare experimenting with combining these scripts with a web 'robot' written by G. Gkoutos (ImperialCollege, Chemistry Department). The robot is able to identify and collect a large range of chemicalfile formats. The intention is a construct a large archive of CML molecules and spectra to supportthis project.

ChiMeraL - xmldoc_writeup (May 22, 2000) - 33

Page 34: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

8 - Writing Complete XML Documents

Pure CML documents are best suited to chemical data-sheets or archived data. A more flexibleapproach is to combine chemistry with text and diagrams to from a report or scientific paper.Namespacing allows different XML languages to be intimately mixed and an excellent textformatting language already exists in XHTML. A sophisticated XML document might thereforeconsist of XHTML formatted text, chemistry using CML and diagrams either as bitmaps (gif or jpg)or drawn in SVG. An XML document language is also needed to describe the document'sstructure and to wrap these various components. A simple 'docuML' language has beendeveloped to illustrate these concepts and used to write several documents, including this report.

A series of document structure elements are defined using an XML schema(http://www.ch.ic.ac.uk/chimeral/karne_schema_01.xml). These elements can be recognised bytheir namespace prefix 'karne:' and include: <document> , <metadata> , <abstract> , <chapter>, <subsection> and <bibliography> (Figure 8.1). Additional elements are used to label figuresand 'escape' blocks of sample code (this prevents them interfering with real XML).

Figure 8.1: Simplified XML document - outlines the document structure

<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="document.xsl" ?><karne:document title="Skeleton" id="xmldoc_skele"

xmlns="http://www.w3.org/1999/xhtml"xmlns:karne="x-schema:http://www.ch.ic.ac.uk/chimeral/karne_schema_01.xml"xmlns:chimeral="x-schema:http://www.ch.ic.ac.uk/chimeral/spectrum_schema_ie_01.xml">

<!-- *********************** metadata *********************** --><karne:metadata>

<karne:date builtin="date:creation">June 12, 2000</karne:date><karne:author idref="insti_ic" id="author_mw" email="[email protected]"href="http://www.ch.ic.ac.uk/chimeral">Michael Wright</karne:author><karne:institution id="insti_ic" href="http://www.ch.ic.ac.uk">Department of Chemistry, Imperial College of Science, Technology and Medicine, UK</karne:institution>

</karne:metadata>

<!-- *********************** abstract *********************** --><karne:abstract>Demonstrates the key elements of an XML document.</karne:abstract>

<!-- *********************** keywords *********************** --><karne:keywords>ChiMeraL, XML, CML, chemical markup language, chemistry</karne:keywords>

<!-- *********************** chapter *********************** --><karne:chapter id="chap_introduction" title="Introduction">

<karne:index/><P>A document is made up of nested chapters and subsection.</P>

<!-- *********************** subsection *********************** --><karne:subsection id="sub_xhtml" title="XHTML">

<karne:index/><P>Elements with no prefix are assumed to be XHTML by the stylesheet and

passed directly to the output. This allows the author to use familiar <I>'HTMLlike'</I> formatting without disrupting other XML blocks. Example code needs tobe escaped;<FONT CLASS="code"><karne:lt/>formula<karne:gt/>C4 H8 O<karne:lt/>/formula<karne:gt/></FONT></P></karne:subsection></karne:chapter>

<!-- *********************** bibliography *********************** --><karne:bibliography title="References">

<karne:ref id="ref_CML1.0">For a formal description of CML version 1.0, see<karne:index>1</karne:index>

<karne:author href="http://www.xml-cml.org/">P. Murray-Rust and H. S. Rzepa</karne:author>

<karne:publication>J. Chem. Inf. Comp. Sci.</karne:publication><karne:date>1999</karne:date><karne:volume>39</karne:volume><karne:pages>928</karne:pages>

</karne:ref></karne:bibliography>

</karne:document>

ChiMeraL - xmldoc_writeup (May 22, 2000) - 34

Page 35: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

An XSL stylesheet has been written combining the CML template fragments with a number oftemplates designed to format the various document components. The stylesheet is generic andable to display any document written using this docuML language. In order to allow the author touse familiar (X)HTML for text formatting, the stylesheet has been designed leave unrecognisedelements unchanged. These pass through to the output, are recognised by the browser as HTMLand formatted appropriately. By setting the document's default namespace to XHTML, no elementprefixes are required and the document can be written using standard HTML editing tools. A CSSstylesheet is also used, allowing changes to the appearance of the transformed document withouthaving to edit the significantly more complex XSL.

Both CML and DocuML components have @id attributes, allowing sophisticated linking behaviour.The stylesheet can display components in any order and is not restricted by the physical order ofthe code in the XML source. CML molecules, spectra and blocks of sample code can beextremely long and would normally interrupt the chapter/subsection structure of the document.This makes it much harder for a human to read or edit. The prefered approach is to separate outCML data and sample code and move them to the end of the document. Links (<karne:linkbuiltin="molecule" display="jmol" idref="mol_thf_1"/> ) can then be used within text orfigures, to reference these external blocks as they are required.

Figure 8.2: Part of an XML document showing linking

..<!-- *********************** subsection *********************** --><karne:subsection id="sub_idlinks" title="Linking"><karne:index/>

<P>Links between @id and @idref are used to automatically build referencesto the bibliography<karne:link builtin="ref" idref="ref_CML1.0"/> or<karne:link builtin="figure" idref="fig_Jmol"/>. The reference will include thecorrect index number, hyperlinks and mouse-over effects, even if the target ismoved or completely rewritten. This helps avoid the notorious HTML problem -"broken links". Links are also used to build the document's index.</P>

<!-- figure title --><karne:figure id="fig_Jmol" builtin="showLabel">Tetrahydrofuran using Jmol</karne:figure>

<!-- this tells the stylesheet to render the CML object with @id="mol_thf_1"using Jmol, changing @display changes the applet used --><karne:link builtin="molecule" display="jmol" idref="mol_thf_1"/></karne:subsection>..<!-- *********************** CML data block *********************** --><cml title="cml data block" id="cml_moldata" xmlns="x-schema:cml_schema_ie_02.xml">

<molecule title="Furan, tetrahydro-" id="mol_thf_1">..

</molecule><chimeral:spectrum title="Caffeine" id="spect_caffeine_ms_1" convention="JCAMP-DX= 4.24">

..</chimeral:spectrum>

</cml>..

Links can also be used to automatically build and maintain literature or figure references. Forexample, the stylesheet can search for the start of each chapter, sub-section and figure, then usethis information to automatically create an index of titles complete with hyperlinks. Documentcomponents can be reordered or even completely rewritten, and these links will be automaticallyupdated. Figure 8.2 shows references to a figure and the bibliography (lines 6 and 7)and a link toa CML molecule (line 17). Although the last instructs the stylesheet to use the Jmol applet torender the molecule, a different applet can be chosen simply by changing the value of @display .

8.1 - Printing XML - Formatting Object and FOP

Formatting Objects (FO) is a sub-language of XSL, intended to describe the layout of documentcomponents on a printed page.24 It comprises of formatting elements rather similar in concept to

ChiMeraL - xmldoc_writeup (May 22, 2000) - 35

Page 36: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

XHTML but describing exact positions and font sizes in contrast to XHTML's relative positioning.FOP is a Java application being developed by Apache.7 When it is combined with a JavaXML/XSL parser (e.g. Xalan) it acts as a formatting object parser and can convert an FO file to anAdobe Acrobat .pdf file. Acrobat is a very widely used for exchanging 'ready for print' electronicdocuments and being a read-only format has important legal implications.

Using multiple stylesheets, an XML document can be transformed for browser display orconverted to a FO file for printing. Since the stylesheets can optimise the document for eachpurpose, the results are much superior to printing directly from the browser. The paper version ofthis report was produced in this way. It is expected that further converters will become availableas this technology matures.

8.2 - In Conclusion

The chemical community is showing increased interest in CML. Further development of thelanguage, its extensions and applications of its use are required to maintain this interest. It ishoped that by demonstrating the first fully operational systems for managing CML and creatingcomplex CML/XHTML documents, we will have provided the basis for this work.

Applications for the CML/XHTML system described here might include: e-journals, safetydata-sheets (for example, as part of FDA drug ratification), patent processing (where microscopiccapture of information is essential, and failure can lead to the paten being challenged) ore-commerce. One can envisage future parsers able to comprehend both CML and XHTML withoutthe requirement for sophisticated stylesheets.49 By including CML support in all computationalchemistry and modelling software, a 'rolling stone' approach can be taken whereby each processthe molecule passes through (structure optimisation, property calculation etc.) involves fullinformation retention. This is in stark contrast to present solutions where information is lost atevery stage. Enhanced applet-CML communication and the possibility of producing CML from aeditor48 also have great potential. Ultimately, server-side processing of XML/XSL stylesheets willallow powerful stylesheet transformations to be accessible using any web browser.

All source code and examples written for this project are available from the ChiMeraL website(http://www.ch.ic.ac.uk/chimeral/).

Acknowledgements

Many thanks to; Henry Rzepa for direction and advice, Peter Murray-Rust (CML, SVG and forrecoding Jspec), Steve Zara (XSL stylesheets), Tom Grey (Jmol applet), Michelle Osmond (perlconverters) Georgios V. Gkoutos (legacy files) and Eric Schaeffer (FOP).

ChiMeraL - xmldoc_writeup (May 22, 2000) - 36

Page 37: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

References1 Demonstrations, examples and further information can be found in the ChiMeraL website,

(http://www.ch.ic.ac.uk/chimeral/)2 For a formal description of CML version 1.0, see P. Murray-Rust and H. S. Rzepa, J. Chem. Inf. Comp. Sci.,1999,

39, 928(http://www.xml-cml.org/)3 H. S. Rzepa, P. Murray-Rust and B. J. Whitaker, Chem. Soc. Revs.,1997, 14 W3C - XHTML and HTML development, (http://www.w3.org/MarkUp/)5 W3C - Scalable Vector Graphics (SVG) overview, (http://www.w3.org/Graphics/SVG/Overview.htm8)6 Adobe SVG browser plugin, (http://www.adobe.com/svg/viewer/install/)7 Apache.com - FOP project (XML convertion to Adobe Acrobat), (http://xml.apache.org/fop/)8 W3C - XML Schema development, (http://www.w3.org/XML/Schema.html)9 Microsoft - XML Schema Developer's Guide,

(http://msdn.microsoft.com/library/default.asp?URL=/library/psdk/xmlsdk/xmlp7k6d.htm)10 H. S. Rzepa, B. J. Whitaker and M. J. Winter, J. Chem. Soc., Chem. Commun.,1994,

(http://www.ch.ic.ac.uk/rzepa/RSC/CC/4_02963A.html)11 O. Casher, G. Chandramohan, M. Hargreaves, C. Leach, P. Murray-Rust, R. Sayle, H. S. Rzepa and B. J.

Whitaker, J. Chem. Soc., Perkin Trans 2,1995, 7(http://www.ch.ic.ac.uk/rzepa/RSC/P2/4_05970K.html)12 P. Murray-Rust, C. Leach and H. S. Rzepa, Abs. Papers. Am . Chem. Soc.,1995, 210, 40-COMP13 MDL information systems, (http://www.mdli.com/cgi/dynamic/welcome.html)14 Rasmol and Chime - Molecular Visualization Freeware, (http://www.umass.edu/microbio/rasmol/)15 Brookhaven Protein Data Bank, (http://www.rcsb.org/pdb/)16 Universal approach to Web-based Chemistry using XML and CML by Peter Murray-Rust, Henry S. Rzepa, Michael

Wright and Stephen Zara (awaiting pulication), (http://www.ch.ic.ac.uk/chimeral/publications/Chemcom.html)17 W3C - XML recommendation, (http://www.w3.org/TR/REC-xml)18 W3C - Annotated XML specification, (http://www.xml.com/axml/axml.html)19 W3C - Mathematical Markup Language development, (http://www.w3.org/Math/)20 XML-CML homesite, (http://www.xml-cml.org/)21 W3C - XML Namespaces recommendation, (http://www.w3.org/TR/REC-xml-names/)22 Adobe Acrobat, (http://www.adobe.com/products/acrobat/main.html)23 W3C - Cascading Stylesheets, (http://www.w3.org/Style/CSS/)24 W3C - Extensible Stylesheet Language (XSL) development, (http://www.w3.org/Style/XSL/)25 Microsoft - XML/XSL development, (http://msdn.microsoft.com/xml/default.asp)26 IBM alphaworks - XML/Java development site, (http://www.alphaworks.ibm.com/)27 Sun - Java Technology Homepage, (http://java.sun.com/)28 W3C - Document Object Model (DOM) Level 1 Specification, (http://www.w3.org/TR/REC-DOM-Level-1/)29 WWW-based Chemical Information System, (http://www.elsevier.com/homepage/saa/eccc3/paper6/)30 ChemFinder.com, (http://chemfinder.camsoft.com/)31 Jmol - Open source Java 3D viewer, (http://www.openscience.org/jmol/)32 The Open Science Project, (http://www.openscience.org/links/Chemistry/)33 Marvin display and editing applet - Chemaxon, (http://www.chemaxon.com/marvin/)34 ACD Labs - Structure Drawing Applet, (http://www.acdlabs.com/products/java/sda/)35 NIST Webbook, (http://webbook.nist.gov/)36 SPC - Spectroscopy Software Solutions, (http://www.galactic.com/)37 Antony N. Davies and Peter Lampen, App. Spec.,1993, 47,

(http://www.isas-dortmund.de/projects/jcamp/protocol.html)38 Peter Lampen, Heinrich Hillig, Antony N. Davis and Michael Linscheid, App. Spec.,1994, 48,

(http://www.isas-dortmund.de/projects/jcamp/protocol.html)39 J. Gasteiger, B. M. P. Hendriks, P. Hoever, C. Jochum and H. Somberg, App. Spec.,1991, 45,

(http://www.isas-dortmund.de/projects/jcamp/protocol.html)40 jspec: A Java Spectral Visualiser for JCAMP-DX and SPC Format Files,

(http://www.ch.ic.ac.uk/java/applets/jspec/spectra/)41 Corel Draw 9 SVG plugin, (http://venus.corel.com/nasapps/DrawSVGDownload/index.html)42 Mayura Draw - editor able to export SVG, (http://www.mayura.com/)43 Csiro SVG Toolkit, (http://sis.cmis.csiro.au/svg/index.html)44 W3C - XLink specifications, (http://www.w3.org/TR/WD-xlink)45 W3C - XPointer specifications, (http://www.w3.org/TR/WD-xptr)46 Perl.com Homesite, (http://www.perl.com/pub)47 , (http://www.gaussian.com/product.htm)48 Open source Java 2D editor, (http://www.ice.mpg.de/~stein/projects/JChemPaint/)49 Laelia: a system for storage, retrieval and querying of XML and CML Documents, (http://www.serf.org/steve/poster/)50 Robert S. McDonald and Paul A. Wilks Jr., App. Spec.,1988, 42,

(http://www.isas-dortmund.de/projects/jcamp/protocol.html)51 Netscape XML, (http://developer.netscape.com/tech/xml/index.html)52 Adobe Illustrator, (http://www.adobe.com/svg/main.html)

ChiMeraL - xmldoc_writeup (May 22, 2000) - 37

Page 38: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

A - List of ChiMeraL Resources

Demonstrations

• Markup and display of molecules, spectra and reactions• Multiple stylesheet selection and the use of data islands• Testing XML/XSL functionality with various browsers• Scaling test - ECTOC database (9 Mb XML file)• Writing a complex XML document and display through IE and .pdf (via FOP)

Schema

• CML schema 0.2 - based on published DTD• CML schema 0.2 (IE version) - includes unique @id and data links• ChiMeraL schema 0.1 - extends CML to include <spectrum>

• Document schema 0.1 - developed to write this report

XSL Stylesheets

• generic.xsl - displays all CML• datasheet.xsl - property information as an HTML table• jme.xsl - Java Molecular Editor applet - editor• jme_format.xsl - JME data string• marvinview.xsl - Marvin applet, 3D viewer mode• marvinedit.xsl - Marvin applet, editor mode• sda.xsl - Structure Drawing Applet - editor• mol.xsl - MDL .mol file format• jmol.xsl - Open Source applet - 3D viewer• xyz.xsl - .xyz file format• jspec.xsl - Java Spectrum Viewer - viewer• jcamp.xsl - JCAMP .dx format• reaction.xsl - stepwise, linear and cycle.4 reactions using Marvin• document.xsl - displaying complex XML document in IE5• document.css - CSS stylesheet for above• doc2fo.xsl - convert complex XML document to formatting object (for use with FOP)

perl converters

• MDL .mol to CML• MDL .mol to JME format• .sd to CML (multiple mol archive)• .xyz to CML• JCAMP .dx and .cs to CML• REACSS .rxn to CML• .rd to CML• .mol, .rxn and .dx 2 CML are all available as an online cgi script

A large number of example CML files are also available, these contain properties, structures,spectra and reactions. Further archives of CML files will be made available as they are converted.

ChiMeraL - xmldoc_writeup (May 22, 2000) - 38

Page 39: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

B - CML Syntax and Notes

The published CML 1.0 DTD2 declares valid elements and attributes but puts few restrictions onhow these elements and attributes are used. As a consequence of this, it is possible to markup aCML object (e.g. a molecule) using a variety of different syntaxes. Whilst this gives great flexibility,it also makes it significantly more difficult to build CML applications. In particularl some of thesesyntaxes can not be easily parsed using stylesheets. I have selected what I feel is the syntax bestsuited to XSL and small molecule markup. This, along with the syntax for chimeral:spectrum andreaction (still experimental) is as follows: (comments on syntax are in green)

<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="document.xsl" ?>

<!-- Declares this document as XML and indicate the URL of its stylesheet -->

<document title="Lipids" id="cmldoc_karne_lipids"xmlns="http://www.w3.org/1999/xhtml"xmlns:chimeral="x-schema:http://www.ch.ic.ac.uk/chimeral/spectrum_schema_ie_01.xml">

<!-- <document> isn't part of CML but represents the top element of any XML compliantdocument, this might contain CML, XHTML, MaML etc. - note the XHTML and chimeral namespaces -->

<cml title="Cholesterol" id="cml_karne_cholesterol"xmlns="x-schema:http://www.ch.ic.ac.uk/chimeral/cml_schema_ie_02.xml">

<!-- The CML namespace points to a copy of the schema, <cml> this may contain any number ofCML 'objects' e.g. <molecule>, <reaction> or <chimeral:spectrum> -->

<molecule title="cholesterol" id="mol_cholesterol"><formula>C27 H46 O</formula>

<!-- Information specific to this molecule is included here as <string>, <float> or <integer>- additional elements can be added as required but note the use of @title to label them.

Alternate names are marked up as a <list> of <string>s -->

<string title="CAS">57-88-5</string><string title="ACX">I1001660</string><string title="RTECS">FZ8400000</string><float title="molecule weight">386.6598</float><float title="melting point" units="degC">148.5</float><float title="boiling point" units="degC">360</float><float title="specific gravity">1.067</float><list title="alternate names">

<string title="name">Cholesterin</string><string title="name">(3beta)-Cholest-5-en-3-ol</string><string title="name">Cholest-5-en-3-ol (3beta)-</string><string title="name">cholest-5-ene-3beta-ol</string><string title="name">3beta-hydroxycholest-5-ene</string>

</list>

<!-- The following is used for small molecular structures- this format is much preferred but rather verbose. <integer builtin="atomId">would normally be that used in the MDL .mol format but in contrast, 'id' must be uniqueover (at least) this document. Additional strings are 'formalCharge' and 'hydrogenCount'.2D structures will use builtin="x2 | y2" but are otherwise the same -->

<list title="atoms"><!-- repeat --><atom id="cholesterol_a_1">

<integer builtin="atomId">1</integer><float builtin="x3" units="A">-1.9901</float><float builtin="y3" units="A">2.1889</float><float builtin="z3" units="A">-1.8776</float><string builtin="elementType">H</string>

</atom><!-- /repeat (74 atoms) -->

</list>

<!-- Large molecular structures - this format is terse but much harder to format/referto in XSL. I have chosen not to use it -->

<atomArray id="methanol"><stringArray title="label">a1 a2 a3 a4 a5 a6</stringArray><stringArray builtin="elementType">C O H H H H</stringArray><floatArray builtin="x3">-0.748 ..</floatArray><floatArray builtin="y3">-0.015 ..</floatArray><floatArray builtin="z3">0.024 ..</floatArray><integerArray builtin="formalCharge"></integerArray>

</atomArray>

<!-- A <list> of <bond>s is used for small molecules- large ones will probably ignore bonds and calculate then directly -->

<list title="bonds">

ChiMeraL - xmldoc_writeup (May 22, 2000) - 39

Page 40: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

<!-- repeat --><bond id="cholesterol_b_1">

<integer title="bondId">1</integer><integer builtin="atomRef">2</integer><integer builtin="atomRef">1</integer><integer builtin="order" convention="MDL">1</integer>

</bond><!-- /repeat (77 bonds) -->

</list></molecule>

<!-- Elements in spectra tend to match the LDHs in the JCAMP format. The namespace chimeral: isvery important as spectrum isn't found in CML 1.0 -->

<chimeral:spectrum title="Cholesterol" id="spect_cholesterol_ms_1"convention="JCAMP-DX=4.24">

<string title="datatype">MASS SPECTRUM</string><string title="EPA">67286</string><string title="origin">T.IIDA NIHON UNIVERSITY, KORIYAMA,FUKUSHIMA-KEN, JAPAN</string>

<string title="owner">NIST Mass Spectrometry Data Center</string><string title="spectrometer">LKB 9000</string>

<!-- The following information is required for the rendering of spectra in Jspec -->

<float title="xunits">M/Z</float><float title="yunits">RELATIVE ABUNDANCE</float><float title="firstx" convention="M/Z">18</float><float title="lastx" convention="M/Z">387</float><float title="deltax" convention="M/Z"></float><float title="xfactor">1</float><float title="firsty" convention="RELATIVE ABUNDANCE">700</float><float title="miny" convention="RELATIVE ABUNDANCE">100</float><float title="maxy" convention="RELATIVE ABUNDANCE">9999</float><float title="yfactor">1</float><float title="npoints">169</float>

<!-- One of the following syntaxes is then used, depending on the type of spectrum -->

<!-- A: simplest and prefered spectra format. Try and avoid (X++(Y..Y)) -->

<list title="xypairs" convention="(XY..XY)"><!-- repeat --><coordinate2 id="cholesterol_ms_c_1">18, 700</coordinate2><!-- /repeat (169 data pairs) -->

</list>

<!-- B: alternate format for peak tables rather then data -->

<list title="peak table" convention="(XY..XY)"><!-- repeat --><coordinate2 id="mol_s_1">X, Y</coordinate2><!-- /repeat -->

</list>

<!-- C: convention often used for NMR peak tables is (XYM) -->

<list title="peak table" convention="(XYM)"><!-- repeat --><coordinate3 id="mol_s_1">X, Y, M</coordinate3><!-- /repeat -->

</list></chimeral:spectrum>

<!-- Reactions are made up of a series of lists, each list containing a number of links @hrefto molecule @id -->

<reaction title="Reactions" id="simple_rxn_1"convention="stepwise | linear | cycle.4 | ..">

<!-- Linear markup is much preferred since it's the simplest, others provided for formattingpurposes. The following three elements can be used either at the reaction level or ateach reaction step -->

<string title="description">Diels-Alder cycloaddition</string><float title="yield" units="%">88</float><string title="notes">example</string>

<!-- Stepwise (x1 > y1, x2 > y2); use as many reactants, reagents and productsas needed for each step, reactions steps will be displayed separately. Additional links tocatalysts, intermediates, transition states etc. can be used as required -->

<!-- repeat --><list title="reactionStep" id="simple_s_1">

<link title="reactant" href="mol_x"/><link title= "reagent" href= "mol_r">

<integer title="index">1</integer><string title="solvent">Acetonitrile</string><string title="temperature" units="degC">100</string><string title="duration" units="hours">3</string><string title= "notes">reflux</string>

</link><link title="reagent">

<integer title="index">2</integer><string title="notes">workup</string>

</link>

ChiMeraL - xmldoc_writeup (May 22, 2000) - 40

Page 41: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

<link title="product" href="mol_y"/></list><!-- /repeat -->

<!-- Linear (x > y > z); reactant refers to the first reactants, productrefers to the final product; intermediates use linearReactant and linearProduct -->

<list title= "linearstep" id="step_1"><link title="reactant" href="mol_1"/><link title="reagent" href="mol_r1">

<!-- .. --></link><link title="linearProduct" href="mol_2"/>

</list><list title= "linearstep" id="step_2">

<link title="linearReactant" href="mol_2"/><link title="reagent" href="mol_r2">

<!-- .. --></link><link title="linearProduct" href="mol_3"/>

</list><list title= "linearstep" id="step_3">

<link title="linearReactant" href="mol_3"/><link title="reagent" href="mol_r3">

<!-- .. --></link><link title="Product" href="mol_4"/>

</list>

<!-- Catalytic cycle - much more complex (.. > x > y > z > ..) reactant and productrefer to substances 'in' and 'out' of the cycle in each step, use cycleReactant andcycleProduct for 'within' the cycle. Markup should be cyclic(final cycleProduct == first cycleReactant) -->

<list title="reactionStep" id="step_1"><link title="cycleReactant" href="mol_1" id="cycle_lk_1"/><link title="reactant" href="mol_3" id="cycle_lk_2"/>

<link title="cycleProduct" href="mol_7" id="cycle_lk_3"/></list><list title="reactionStep" id="step_2">

<link title="cycleReactant" href="mol_7" id="cycle_lk_4"/><link title="cycleProduct" href="mol_6" id="cycle_lk_5"/><link title="product" href="mol_8" id="cycle_lk_6"/>

</list><list title="reactionStep" id="step_3">

<link title="cycleReactant" href="mol_6" id="cycle_lk_7"/><link title="reactant" href="mol_5" id="cycle_lk_8"/><link title="cycleProduct" href="mol_4" id="cycle_lk_9"/>

</list><list title="reactionStep" id="step_4">

<link title="cycleReactant" href="mol_4" id="cycle_lk_10"/><link title="cycleProduct" href="mol_1" id="cycle_lk_11"/><link title="product" href="mol_2" id="cycle_lk_12"/>

</list>

<!-- list title="atomMap" could be placed within each step -->

</reaction></document>

ChiMeraL - xmldoc_writeup (May 22, 2000) - 41

Page 42: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

C - CML Schema (IE 0.2) Notes

This version of the schema is based directly on the CML 1.0 DTD but includes datatypedeclarations for use in IE 5.x. These datatypes allow the use of @id and @href for intra-documentlinking. A platform independent version (plus schemas for chimeral: and karne:) are available fromthe ChiMeraL site.1 (comments on syntax are in green)

<?xml version="1.0"?><Schema name="cml_dev_karne"

xmlns="urn:schemas-microsoft-com:xml-data"xmlns:dt="urn:schemas-microsoft-com:datatypes">

<description>CML development - Version 0.2 - 7/4/00

This document is the first draft of an XML Schema compatible with CMLV1.0 published in JCICS... In converting the schema we have suggestedsome closed and some open content models. As CML develops it seems likelythat there will be advantage in opening some of the models (e.g. angle),perhaps for annotations and ancillary information. Readers should notethat the XML Schema activity is still at draft stage and that this schemamay be revised in the future for compatibility.

This schema is intended for use with IE5 - a platform independent versionis available. XML documents can be validated against this schema by addingxmlns="x-schema:URL" within the cml element.

Please see cml_schema_ie_02.html for further comments and explanations.

Peter Murray Rust, Henry Rzepa, Michael Wright

Comments to Michael Wright - [email protected]</description>

<!-- ********** Attribute Types ********** --><!-- *** common *** -->

<!-- It is expected that these attributes will be found on almost all elements.@title is used for display and general labelling and @id indicates a documentunique identity string for that element. This identity can then be used toreference that element (and hence the object it represents) by the use of@href. Care must be taken that @id is unique. The use (for example) of<atom id="1"> will cause trouble in a document with morethan one molecule and should be avoided. Note that @title on data elements(string/integer/float) can be used to markup data for which there is noexplicit CML element (e.g. <string title="CAS">58-08-2</string>)and @convention should be declared if this is not obvious. @builtin impliesan element with a meaning predefined in the DTD -->

<AttributeType name="title" required="no"/><AttributeType name="id" required="no" dt:type="id"/><AttributeType name="convention" required="no"/><AttributeType name="builtin" required="no" dt:type="enumeration"

dt:values="x2 y2 xy2 x3 y3 z3 xyz3 xFract yFract zFract xyzFract elementTypeatomId isotope occupancy hydrogenCount atomParity residueType residueIdformalCharge atomRef atomRefs length order stereo acell bcell ccell alphabeta gamma z spacegroup" />

<!-- *** linkers *** -->

<!-- Designed to allow the linking of an element to another via @id. For example<link href="mol_543"> indicates a link to any element with @id="mol_543".@unitsRef and @dictRef are intended for future use -->

<AttributeType name="href" required="no" dt:type="idrefs"/><AttributeType name="dictRef" required="no" dt:type="idrefs"/><AttributeType name="unitsRef" required="no" dt:type="idrefs"/><AttributeType name="atomRef" required="no" dt:type="idref"/><AttributeType name="atomRefs" required="no" dt:type="idrefs"/>

<!-- *** quantifiers/constraints *** -->

<!-- Various constraints on the values of data elements, only 'units' is in common use -->

<AttributeType name="count" required="no"/><AttributeType name="size" required="no"/><AttributeType name="rows" required="no"/><AttributeType name="columns" required="no"/><AttributeType name="min" required="no"/><AttributeType name="max" required="no"/><AttributeType name="units" required="no"/>

<!-- ********** Element Types ************ --><!-- *** data *** -->

<!-- These elements are intended to contain only text string and attributes - they maynot contain other elements - and are hence 'closed' (no additional markup can beadded beyond the schema). If additional data holders are required, they should be of

ChiMeraL - xmldoc_writeup (May 22, 2000) - 42

Page 43: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

the form <string title="CAS">58-08-2</string>where the title value is used to identify the elements contents. For full list ofattributes, please see the schema -->

<ElementType name="string" content="textOnly" model="closed"><attribute type="id"/><attribute type="builtin"/><attribute type="title"/><attribute type="convention"/><attribute type="dictRef"/>

</ElementType>

<ElementType name="float" content="textOnly" model="closed"><attribute type="id"/><attribute type="builtin"/><attribute type="title"/><attribute type="convention"/><attribute type="min"/><attribute type="max"/><attribute type="units"/><attribute type="unitsRef"/><attribute type="dictRef"/>

</ElementType>

<ElementType name="integer" content="textOnly" model="closed"><attribute type="id"/><attribute type="builtin"/><attribute type="title"/><attribute type="convention"/><attribute type="min"/><attribute type="max"/><attribute type="units"/><attribute type="unitsRef"/><attribute type="dictRef"/>

</ElementType>

<!-- The arrays should be used only when large amounts of data needto be stored. In all cases explicit markup of each unit of data is preferred since thisis much easier to manipulate with a stylesheet. In some cases this may not be possible(e.g. large molecules with 200+ atoms) and arrays can be used. These would probably needto be parsed by dedicated chemical tools. floatMatrix is intended for computationalchemistry -->

<ElementType name="stringArray" content="textOnly" model="closed"><attribute type="id"/><attribute type="builtin"/><attribute type="title"/><attribute type="convention"/><attribute type="size"/><attribute type="min"/><attribute type="max"/><attribute type="dictRef"/>

</ElementType>

<ElementType name="floatArray" content="textOnly" model="closed"><attribute type="id"/><attribute type="builtin"/><attribute type="title"/><attribute type="convention"/><attribute type="size"/><attribute type="min"/><attribute type="max"/><attribute type="units"/><attribute type="unitsRef"/><attribute type="dictRef"/>

</ElementType>

<ElementType name="integerArray" content="textOnly" model="closed"><attribute type="id"/><attribute type="builtin"/><attribute type="title"/><attribute type="convention"/><attribute type="size"/><attribute type="min"/><attribute type="max"/><attribute type="units"/><attribute type="unitsRef"/><attribute type="dictRef"/>

</ElementType>

<ElementType name="floatMatrix" content="textOnly" model="closed"><attribute type="id"/><attribute type="title"/><attribute type="convention"/><attribute type="rows"/><attribute type="columns"/><attribute type="min"/><attribute type="max"/><attribute type="units"/><attribute type="unitsRef"/><attribute type="dictRef"/>

</ElementType>

<!-- Coordinates consist of 2 or 3 comma separated numbers. They shouldbe used for logically linked number groups - e.g. a data point in a spectrum -->

<ElementType name="coordinate2" content="textOnly" model="closed"><!-- use for data pairs (e.g spectra xy) -->

ChiMeraL - xmldoc_writeup (May 22, 2000) - 43

Page 44: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

<attribute type="id"/><attribute type="builtin"/><attribute type="title"/><attribute type="convention"/><attribute type="units"/><attribute type="unitsRef"/><attribute type="dictRef"/>

</ElementType>

<ElementType name="coordinate3" content="textOnly" model="closed"><!-- use for data triplets (e.g spectra x, y, m) --><attribute type="id"/><attribute type="builtin"/><attribute type="title"/><attribute type="convention"/><attribute type="units"/><attribute type="unitsRef"/><attribute type="dictRef"/>

</ElementType>

<!-- Angle and torsion haven't yet been developed, hence they have been left open -->

<ElementType name="angle" content="textOnly" model="open"><attribute type="id"/><attribute type="title"/><attribute type="convention"/><attribute type="atomRefs"/><attribute type="min"/><attribute type="max"/><attribute type="units" default="deg"/><attribute type="unitsRef"/><attribute type="dictRef"/>

</ElementType>

<ElementType name="torsion" content="textOnly" model="open"><attribute type="id"/><attribute type="title"/><attribute type="convention"/><attribute type="atomRefs"/><attribute type="min"/><attribute type="max"/><attribute type="units" default="deg"/><attribute type="unitsRef"/><attribute type="dictRef"/>

</ElementType>

<!-- *** mixed *** -->

<!-- Link is used as a 'holder' element - e.g. for 'href' or 'unitsRef'and indicates a logical link to another CML object -->

<ElementType name="link" content="mixed" model="open"><attribute type="id"/><attribute type="title"/><attribute type="href"/>

</ElementType>

<ElementType name="formula" content="mixed" model="open"><attribute type="id"/><attribute type="title"/><attribute type="convention"/><attribute type="count"/><attribute type="dictRef"/>

</ElementType>

<!-- *** structural *** -->

<!-- These elements define the tree structure of the CML document and are not expected tocontain text strings - only other elements. Since they are 'open', additional sub elements- whether CML or not - can be added with correct namespacing. Attributes and elementsfor these elements have been declared in the schema but these are only suggestions -the DTD allows anything. Many, like electron and the crystalagraphic markup have yetto be developed -->

<ElementType name="atom" content="eltOnly" model="open" order="many"><attribute type="id"/><attribute type="title"/><attribute type="convention" default="mol"/><attribute type="count"/><attribute type="dictRef"/><element type="float" minOccurs="0" maxOccurs="*"/><element type="string" minOccurs="0" maxOccurs="*"/><element type="integer" minOccurs="0" maxOccurs="*"/><element type="link" minOccurs="0" maxOccurs="*"/>

</ElementType>

<ElementType name="bond" content="eltOnly" model="open" order="many"><attribute type="id"/><attribute type="convention" default="mol"/><attribute type="atomRef"/><attribute type="atomRefs"/><!-- avoid using these and use a subelement with builtin="atomRef" --><element type="integer" minOccurs="0" maxOccurs="*"/><element type="float" minOccurs="0" maxOccurs="*"/><element type="string" minOccurs="0" maxOccurs="*"/><element type="link" minOccurs="0" maxOccurs="*"/><element type="angle" minOccurs="0" maxOccurs="*"/><element type="torsion" minOccurs="0" maxOccurs="*"/>

ChiMeraL - xmldoc_writeup (May 22, 2000) - 44

Page 45: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

</ElementType>

<ElementType name="list" content="eltOnly" model="open" order="many"><attribute type="id"/><attribute type="title"/><!-- required: atoms/bonds/xypairs/peak table/atom map--><attribute type="convention"/><!-- should be included for spectra: (XY..XY), (XYM) --><element type="string" minOccurs="0" maxOccurs="*"/><element type="integer" minOccurs="0" maxOccurs="*"/><element type="float" minOccurs="0" maxOccurs="*"/><element type="floatArray" minOccurs="0" maxOccurs="*"/><element type="stringArray" minOccurs="0" maxOccurs="*"/><element type="integerArray" minOccurs="0" maxOccurs="*"/><element type="floatMatrix" minOccurs="0" maxOccurs="*"/><element type="angle" minOccurs="0" maxOccurs="*"/><element type="torsion" minOccurs="0" maxOccurs="*"/><element type="coordinate2" minOccurs="0" maxOccurs="*"/><element type="coordinate3" minOccurs="0" maxOccurs="*"/><element type="link" minOccurs="0" maxOccurs="*"/><element type="formula" minOccurs="0" maxOccurs="*"/><element type="atom" minOccurs="0" maxOccurs="*"/><element type="bond" minOccurs="0" maxOccurs="*"/>

</ElementType>

<ElementType name="atomArray" content="eltOnly" model="open" order="many"><!-- only use for large molecules --><attribute type="id"/><attribute type="title"/><attribute type="convention"/><attribute type="dictRef"/><element type="stringArray" minOccurs="0" maxOccurs="*"/><element type="floatArray" minOccurs="0" maxOccurs="*"/><element type="integerArray" minOccurs="0" maxOccurs="*"/><element type="link" minOccurs="0" maxOccurs="*"/>

</ElementType>

<ElementType name="bondArray" content="eltOnly" model="open" order="many"><!-- only use for large molecules --><attribute type="id"/><attribute type="convention"/><attribute type="dictRef"/><element type="stringArray" minOccurs="0" maxOccurs="*"/><element type="floatArray" minOccurs="0" maxOccurs="*"/><element type="integerArray" minOccurs="0" maxOccurs="*"/><element type="link" minOccurs="0" maxOccurs="*"/>

</ElementType>

<ElementType name="electron" content="eltOnly" model="open"><attribute type="id"/><attribute type="count"/><attribute type="convention"/><attribute type="dictRef"/>

</ElementType>

<ElementType name="crystal" content="eltOnly" model="open"><attribute type="id"/><attribute type="title" default="crystal"/><attribute type="convention"/><attribute type="dictRef"/>

</ElementType>

<ElementType name="sequence" content="eltOnly" model="open"><attribute type="id"/><attribute type="title"/><attribute type="convention"/><attribute type="dictRef"/>

</ElementType>

<ElementType name="feature" content="eltOnly" model="open"><attribute type="id"/><attribute type="title"/><attribute type="convention"/><attribute type="dictRef"/>

</ElementType>

<ElementType name="molecule" content="eltOnly" model="open" order="many"><attribute type="id"/><attribute type="title" default="molecule"/><attribute type="convention" default="mol"/><attribute type="dictRef"/><element type="formula" minOccurs="0" maxOccurs="*"/><element type="list" minOccurs="0" maxOccurs="*"/><element type="atomArray" minOccurs="0" maxOccurs="*"/><element type="bondArray" minOccurs="0" maxOccurs="*"/><element type="string" minOccurs="0" maxOccurs="*"/><element type="float" minOccurs="0" maxOccurs="*"/><element type="integer" minOccurs="0" maxOccurs="*"/><element type="link" minOccurs="0" maxOccurs="*"/>

</ElementType>

<ElementType name="reaction" content="eltOnly" model="open" order="many"><attribute type="id"/><attribute type="title" default="reaction"/><attribute type="convention"/><attribute type="dictRef"/><element type="string" minOccurs="0" maxOccurs="*"/>

ChiMeraL - xmldoc_writeup (May 22, 2000) - 45

Page 46: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

<element type="float" minOccurs="0" maxOccurs="*"/><element type="integer" minOccurs="0" maxOccurs="*"/><element type="list" minOccurs="0" maxOccurs="*"/>

</ElementType>

<!-- ********** root ********** -->

<!--This is the general 'holder' for all CML documents. Normally thiswould then be embedded within an XML <document> Namespaces/schemashould be declared within this element or the document root-->

<ElementType name="cml" content="mixed" model="open" order="many"><attribute type="id"/><attribute type="title" default="cml document"/><attribute type="convention"/><attribute type="dictRef"/><element type="molecule" minOccurs="0" maxOccurs="*"/><element type="crystal" minOccurs="0" maxOccurs="*"/><element type="reaction" minOccurs="0" maxOccurs="*"/>

</ElementType></Schema>

ChiMeraL - xmldoc_writeup (May 22, 2000) - 46

Page 47: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

D - Glossary• Applet (Java)

A small program designed to run embedded within a web page (either HTML or XML). The applet isdownloaded from the same site as the web page and does not require installation (unlike a plugin). Forsecurity reasons, applets has restricted access to the local hard drive.

• ASCIIRefers to the standard set of text characters used for all computers. An ASCII file can be opened in anormal text editor (in contrast to a binary file which can not). XML/HTML and most of the legacyformats use ASCII.

• AttributeUsed to give more specific information within an HTML/XML tag (hence <FONT SIZE="+1"> hasattribute 'SIZE'), @href is shorthand for 'attribute called href'

• CML (Chemical Markup Language)XML language optimised for marking up chemistry.

• CSS (Cascading stylesheet)Stylesheet primarily designed for HTML, allows tag properties to be changed (over-riding the parserdefaults).

• DTDOlder standard for describing XML languages, uses a unique syntax.

• DOM (Document Object Model)'Tree' model of a document, allows the parser scripting access to all 'objects' within the tree. Underdevelopment by the W3C

• ElementXML equivalent to HTML tags. Refers to the tag pair, any attributes and their contents

• HTML(hypertext markup language)Text formatting and display language - standard electronic format on the web

• JavaScriptA scripting language (which has little to do with Java, although they can communicate) used to addinteractivity to web pages. Javascript functions need to be triggered in some way (normally using aFORM> button).

• MarkupProcedure of labelling and organising information using <tag></tag> pairs and attributes. Basis formost formatting languages.

• NamespacesSystem for assigning elements to different XML languages. Very important if a document is to bevalidated.

• ParserProgram able to read XML, comprehend its structure and build a DOM and/or use a stylesheet todisplay it. This project uses Internet Explorer 5.x

• PluginSmall, platform dependent program that 'plugs into' a web browser and allows it to display additionalfile formats (e.g. .mol, .svg, .vrml). Needs to be installed by the user.

• PerlPowerful text search and replace language (amongst other things) and used to build all thelegacy2CML converters in this project. Also used for server-side cgi scripts.

• SchemaNewer standard for describing XML languages, uses XML rules and syntax.

• SVG (Scalable Vector Graphics)XML language optimised for describing vector ('line' as compared to bitmap/gif/jpg) images.

• TagBasic unit of markup, delimited by square brackets: <tag>

• Well formedA well formed document is one that complies to XML rules and syntax.

• ValidA valid document is well formed and conforms to a language description given in a DTD or schema.

ChiMeraL - xmldoc_writeup (May 22, 2000) - 47

Page 48: Development of Chemical Markup Language (CML) as a System …€¦ · variety of purposes (e.g. screen display or printing7) by single stylesheet transformations. An XML schema8 9

• XHTML(extensible hypertext markup language)HTML tags corrected to follow XML rules

• XML(extensible markup language)A series of rules for writing other 'XML compliant' languages

• XSL(extensible stylesheet language)Stylesheet language, includes powerful scripting and querying features.

ChiMeraL - xmldoc_writeup (May 22, 2000) - 48