chemical vocabularies and ontologies for bioinformatics · 2015-07-28 · chemical vocabularies and...
TRANSCRIPT
Chemical Vocabularies and Ontologies for Bioinformatics
Kirill Degtyarenko
European Bioinformatics Institute, Wellcome Trust Genome Campus,
Hinxton, Cambs CB10 1SD, United Kingdom. E-mail: kiri l [email protected]
Proceedings of the 2003 International Chemical Information Conference, Nîmes,
France, 19-22 October 2003.
The diversity of objects and concepts in biological chemistry can be reflected in the number
of ways used to describe an ‘elementary’ biochemical event such as enzymatic reaction. The
terminology used in publications or biological databases is often a mixture of terms borrowed
from widely different or even contradictory classifications. The ever-growing knowledge
cannot be processed meaningfully (e.g. efficiently and correctly referenced in biological
databases) without organisation, from controlled vocabularies to dictionaries and thesauri to
taxonomies and formal ontologies. Ontology of some domain of knowledge is a controlled
vocabulary of terms with defined logical relationships to each other. The unique types of
relationships between terms have to be included in biochemical ontologies. The relevance of
chemical thesauri and ontologies to bioinformatics is illustrated by current resources and
projects at the European Bioinformatics Institute, such as IntEnz (Enzyme Nomenclature),
COMe (the bioinorganic motif database) and the IUPHAR Receptor Database.
Introduction
‘Ontology’ is a formal definition of concepts (such as entities and relationships) of a given
area of knowledge, described in a standardised form [ 1]. It can be organised as a structured
vocabulary in the form of a directed acyclic graph or a network in which each term may be a
‘child’ of one or more ‘parent’ [ 2].
Naturally, sequences form the core data of biological sequence databases such as EMBL [ 3]
and Swiss-Prot [ 4], while 3-D coordinates form the core data of structural databases such as
PDB [ 5]. The other data typically found in a database entry, such as the name of gene or
protein, organism, or literature references, are called annotation. The quality of annotation
varies from entry to entry and from database to database.
Proceedings of the 2003 International Chemical Information Conference 2
As a rule, the annotation is present as free text. Since proteins and nucleic acids are
biochemical entities, chemical and biochemical terminology make a big proportion of this
free text. I will try to present some challenges and achievements in standardisation of
chemical language in macromolecular databases.
Vocabularies and Ontologies for Biological Databases
The free text annotation is easy to read but difficult to standardise. The bioinformatics
community is spending much effort towards making it ‘less free’ with help of controlled
vocabularies. In contrast to sequence and structural databases, the core data of vocabularies
consist of terms. One of the first controlled vocabularies for biology was the NCBI
Taxonomy [ 6]. Every natural sequence deposited to biosequence databases is supposed to
originate from an organism with a known Linnaean name:
“The NCBI Taxonomy database only contains the names for the organisms whose
sequences have been made public by the collaborating sequence database EMBL,
DDBJ, and NCBI/GenBank or by one of the other public databases that are indexed in
Entrez (including the Swiss-Prot, PIR and PRF protein sequence databases and the
PDB structure database). Currently sequence data are available for only about
100,000 of the about 2–10 million species supposed to exist on earth” [ 6]
Taxonomy is a strict hierarchy of parent–child relationships known as IsA (‘is kind of’). For
example, abbreviated NCBI taxonomy of Homo sapiens can be represented as
root %Eukaryota %Metazoa %Chordata %Craniata %Vertebrata %Euteleostomi %Mammalia %Eutheria %Primates %Catarrhini %Hominidae %Homo %Homo sapiens
The Gene Ontology Consortium [ 2] develops controlled vocabularies which are in wide use
by the bioinformatics community. The Gene Ontology (GO) comprises three domains:
Chemical Vocabularies and Ontologies for Bioinformatics
3
‘molecular function’, ‘biological process’ and ‘cellular component’. In every domain,
the GO terms are organised as directed acyclic graph (DAG), which differ from taxonomies
in that a child term can have many parent terms. GO uses two generic parent–child
relationships, IsA and IsPartOf. Throughout the text, I will use the symbols for these
relationships as specified in GO File Format Guide [ 7]:
% = IsA
< = IsPartOf
GO is a part of the wider initiative known as OBO (Open Biology Ontologies). A list of
freely available ontologies that are relevant to genomics and proteomics and are structured
similarly to GO can be found at the OBO website [ 8].
Vocabularies and Ontologies for Biochemical Compounds
What are the ‘biochemical compounds’?
Any chemical compound naturally occurring in living organisms can be called a
‘biochemical compound’. Biochemical compounds can be classified according to their
structure, physico-chemical properties or biological function. Most biologists conveniently
divide all biochemical compounds into ‘biopolymers’, which consist of macromolecules, and
the ‘other compounds’, which consist of ‘small’ molecules (see, for instance, BioCyc
Taxonomy of Compounds [ 9]). Alternatively, biochemical compounds can be defined as
consisting of “molecules not directly encoded by the genome (thus excluding nucleic acids,
proteins and peptides derived from proteins by cleavage), that are either the products of
nature or are synthetic products used (either purposively or accidentally) to intervene in the
processes of living organisms” [M. Ashburner]. This second definition reflects ‘traditional’
bioinformatics view in the sense that information-rich macromolecules live in their databases
(EMBL, Swiss-Prot) relatively independently from all the other molecules, whether small or
large. As we will see, these two worlds significantly overlap.
Structure
The intuitive classification of all molecules relevant to biochemistry into macromolecules
and ‘small’ molecules is not as straightforward as it looks. E.g. globular proteins and tRNAs
are very compact, discrete molecules. In eukaryotes, genomic DNA exists as a
supramolecular DNA–protein complex. A gene is a part of a large DNA molecule, while the
Proceedings of the 2003 International Chemical Information Conference 4
corresponding mRNA is a discrete molecule. Pyrroloquinoline quinone (coenzyme PQQ), a
typical ‘small’ molecule, is synthesised in vivo from a 24-amino-acid polypeptide precursor [
10].
Polypeptides and nucleic acids are often referred to as ‘biopolymers’. It has to be noted that at
least in two aspects they are fundamentally different from other biopolymers (such as
polysaccharides, isoprenoids, lignins) as well as other natural and synthetic polymers. First,
while most polymers consist of chains of variable length, it is perfectly possible to purify a
chemically homogeneous protein or nucleic acid. This feature makes it possible to store the
amino acid and nucleic sequences as ultimate identifiers in the databases. Second, while most
polymers consist of low complexity (low information content) macromolecules, natural
proteins and nucleic acids are, as a rule, high complexity (high information content)
macromolecules. (Genes often occupy only a small fraction of eukaryotic DNA while the rest
is low complexity non-coding regions.) This feature makes it necessary to store the amino
acid and nucleic sequences in the databases. In contrast, for most polymers it is sufficient to
store a constitutional unit [ 11].
Bioinformatics routinely deals with one-dimensional (1-D) objects like protein and nucleic
acid sequences, and three-dimensional (3-D) objects such as crystal and solution structures in
PDB. Note that here we talk about mathematical objects, not real proteins or nucleic acids
which exist in three dimensions. For example, almost everybody is ‘familiar’ with double
helix model of DNA. Nucleotides of one strand are supposed to form Watson-Crick pairs
with nucleotides of the complimentary strand, so the sequence of only one strand is stored in
the databases. Since sequence is a very convenient way of representing a biomacromolecule,
it is easy to forget that other dimensionalities exist at all!
By definition, a macromolecule consists of many monomers. In 1-D representation, each
monomer is ‘collapsed’ to a symbol (dimensionality is close to 0). Dimensionality of cyclic
DNA, such as bacterial chromosomes and plasmids, genomes of mitochondria and
chloroplasts, is (negligibly) less than 1. In the biological databases, the corresponding
sequences are stored as 1-D with an arbitrarily set start nucleotide (often it is the DNA
replication origin site). In case of cyclic polypeptides such as cyclotides [ 12], the
dimensionality is less than 1 but the precursor polypeptides are linear, so the starting amino
acid residue is not arbitrarily chosen. Dimensionality of branched macromolecules such as
starch is more than 1. Similarly, dimensionality of a protein containing covalent links
Chemical Vocabularies and Ontologies for Bioinformatics
5
between two polypeptides (each 1-D) also should be considered >1, while macromolecules
such as lignin form networks and require at least 2-D representation.
Although ‘small molecules’ appear to be less complex entities than macromolecules, their
naming, citation and representation in databases is not a trivial task. Most genetically
encoded biomacromolecules are easily represented as 1-D strings, while the 2-D sketch
remains the most adequate portrait of a ‘small molecule’. Several algorithms of linear
notation have been developed, e.g. SMILES [ 13]. However linear notation, as any other
structural core data, cannot be really used in speech (and should not be used in free text). The
good annotation practice for biological databases is to use either consistent and widely
recognised terminology or unique identifiers (to look up the molecule of interest from a
dedicated database). Ideally, scientists should use terminology that is both pronounceable
and meaningful. IUPAC systematic names are meaningful but often not pronounceable.
Therefore, biologists and chemists alike prefer to use common names. The use of common
names is not a problem as long as there is no confusion regarding the exact meaning of a
term. Thus, the viable solution for bioinformatician will be to use a definitive controlled
vocabulary of biochemical compounds, which contains both systematic and common names.
Interestingly, although Enzyme Nomenclature [ 14] contains terminology for enzymatic
reactions approved by Nomenclature Committee of the International Union of Biochemistry
and Molecular Biology (NC-IUBMB), no definitive terminology for the very compounds
involved in these reactions was published by NC-IUBMB, with the Glossary of Chemical
Names being the only exception [ 15]. Whatever the reason, the nontrivial task to derive these
terms from Enzyme Nomenclature was left to others!
There are several (bio)chemical compound databases in public domain. COMPOUND is a
part of LIGAND database [ 16]. COMPOUND includes all the compounds (i.e. substrates,
products, cofactors, inhibitors and activators) derived from Enzyme Nomenclature. Every
COMPOUND entry minimally has a unique identifier and a name, while many ‘small
molecule’ entries also have 2-D diagrams. The compounds are fairly heterogeneous:
Proceedings of the 2003 International Chemical Information Conference 6
COMPOUND
1. ‘Small molecule’
• individual compound, e.g. C00556 Benzyl alcohol
• a class of compounds, e.g. C00069 Alcohol
2. Macromolecule
• As a whole individual molecule, e.g. C02396 Cytochrome b-562
• As a class of compounds, e.g. C00420 Polysaccharide
• As a site, e.g.
C02959 Apurinic site in DNA
C04764 C-terminal glycine residue of the polypeptide ubiquitin
3. A class of ‘molecules’ classified by chemical function
• C00030 Reduced acceptor
• C11349 Amino group donor
Glossary of Chemical Names [ 15] originally appeared in Enzyme Nomenclature [ 14].
Additional Glossary entries have been added from subsequent Supplements. In contrast to
COMPOUND, Glossary represents a small subset of (less common) chemical terms.
Contents-wise, it is also very heterogeneous as examples show:
IUBMB Glossary of Chemical Names
1. Thesaurus (gives more broad term):
12-dehydrotetracycline = an antibiotic
2. Common–(semi)systematic (bilingual) dictionary:
cis-aconitate = (Z)-prop-1-ene-1,2,3-tricarboxylate
3. Common–formula (bilingual) dictionary:
0-D: superoxide = O2•-
1-D: amastatin = Leu[1ψ2,CHOHCONH]ValValAsp
2-D: quinine = an alkaloid (structure)
4. Common name dictionary (with definitions):
fusarinine C = a cyclic trihydroxamic acid formed by
esterification of 3 molecules of fusarinine
Chemical Vocabularies and Ontologies for Bioinformatics
7
NIST Chemistry WebBook [ 17], developed at the National Institute of Standards and
Technology, contains data on small organic and some inorganic compounds. Apart from
name(s), formulae, CAS registry numbers, and structure, NIST Chemistry WebBook
contains additional data on physico-chemical properties of the species.
All the resources mentioned so far have no hierarchical structure in terms of searching for
classes of compounds, e.g. ‘alcohol’. In COMPOUND, the terms for compound classes are
available but there are no links to individual compounds from that class. However, for most
biochemical compounds, whether ‘small’ or macromolecules, the structural classification
can be based on existing chemical nomenclature systems, such as organic, inorganic and
macromolecular, or even created automatically on the basis of substructure search.
The promisingly named Chemical Ontology, developed by M. Ashburner and P. Jaiswal [
18], is an interesting prototype. Structurally it is similar to GO and organised as a DAG with
IsA the only kind of relationship used. Data sources include the chemical names as currently
used in GO and the external sources such as BioCyc [ 9], COMPOUND, ENZYME [ 19] and
UM-BBD [ 20].
All terms for compounds are classified according to either chemical nature
(grouped_by_chemistry) or biological function (grouped_by_functions), or yet to be
classified (unclassifieds). A classified compound may belong to more than one structural
and more than one functional class, therefore the different classification approaches may be
reconciled. An alternative ontology for molecular matter was suggested by the author in an
e-mail exchange with Michael Ashburner:
Proceedings of the 2003 International Chemical Information Conference 8
%molecular matter %grouped_by_state_of_matter %plasma %gas %liquid %solid %heterogeneous mixture %grouped_by_composition %compound ; synonym:chemical substance <formula unit <molecular entity %atom <electron <nucleus <proton <neutron %element %atomic ion %atomic radical %molecule <group %molecular ion %molecular radical %crystal molecule %ionic crystal molecule %metallic crystal molecule %covalent molecule %discrete covalent molecule %coordination molecule %giant covalent molecule %ion %atomic ion %molecular ion %radical %atomic radical %molecular radical %mixture <compound %heterogeneous mixture %colloidal suspension %liquid aerosol %solid aerosol %foam %emulsion %sol %solid foam %gel %solid sol %homogeneous mixture %solution <solute <solvent %solid solution
Chemical Vocabularies and Ontologies for Bioinformatics
9
However, the relationships between chemical entities go beyond IsA. Importantly, the
distinction has to be made between groups and molecules. For example, constitutional unit
(i.e. group) IsPartOf a macromolecule but monomer molecule is not, although is may be
viewed as precursor of a macromolecule. In organic chemistry, it is usual to consider a
‘parent hydride’ of a specific compound for naming purposes [ 21]. This parent hydride,
being a specific compound itself, can be considered to be both a member and a parent of the
class of compounds.
Physico-chemical methods and properties
It is not just chemical compound terminology that lacks standardisation in biological
databases. For example, Swiss-Prot entries include literature citations dealing with
characterisation of proteins, where RP field may include methods used. However, the
terminology is not consistent, as the example for circular dichroism shows:
CD STUDIES CIRCULAR DICHROISM CIRCULAR DICHROISM ANALYSIS CIRCULAR DICHROISM SPECTROSCOPY MAGNETIC CIRCULAR DICHROISM STRUCTURE BY CIRCULAR DICHROISM
The information that “magnetic circular dichroism” is a variation of a more general method
(IsA “circular dichroism”) and “CD” is an abbreviation (i.e. synonym) of “circular
dichroism” is just not there.
The alpha-release of FIX (physico-chemical ontology for biology) is available at the OBO
website [ 22]. FIX includes two components: physico-chemical property and
physico-chemical method. In addition to IsA and IsPartOf, the relationship called
inferred_by between ‘property’ and ‘method’ entities is introduced. Of course
inferred_by can be considered merely a shortcut, for methods usually do not yield any
properties directly. Instead, one can design much more complex ontologies, e.g.
method (e.g. circular dichroism spectroscopy) based_on phenomenon (e.g.
circular dichroism) applied_to object (e.g. protein) yields data (e.g.
spectrum) contains feature (e.g. peak) corresponding_to (value_of)
property (e.g. “30% of alpha-helix”)
Proceedings of the 2003 International Chemical Information Conference 10
Currently, no verbal definitions are provided for FIX terms. However, the method can be
defined already via its place in ontology. For instance, “electron-nuclear double resonance
spectroscopy” (ENDOR; FIX:0000024) IsA “combined electron and nuclear magnetic
resonance spectroscopy” (FIX:0000165) which, in turn, IsA both “nuclear magnetic
resonance spectroscopy” (NMR; FIX:0000022) and “electron paramagnetic resonance
spectroscopy” (EPR; FIX:0000023).
FIX terms for physico-chemical properties may be used for annotation of both molecules (at
molecular level) and compounds (at molar level).
Biological function
If physico-chemical properties are not easily derived from chemical structure, then what
about biochemical properties? Biological function is not an immanent feature of the molecule
but a result of a specific interaction, e.g. with proteins or nucleic acids. The same compound
will behave remarkably differently in different organisms, or different cells, or different
metabolic pathways within the same cell. Therefore, the host of possible functions and
relationships one can think of, e.g. ‘a precursor of’, ‘an ihibitor of’, are only meaningful
when the biological context is specified.
Since all ‘functions’ of a molecule could be broadly divided into structural (to form a part of)
and chemical (participate in a reaction), they could be formed as cross-products of chemical
ontology with corresponding components of biological ontologies. At least, it seems that an
extension of Gene Ontology towards ‘small molecules’ is only logical: it is not only gene
products that can have molecular function, participate in biological process and form part
of cellular component! Of course, other ontologies (e.g. for toxicology or ecology) will
result in a different set of biological functions.
Chemical Vocabularies and Ontologies for Bioinformatics
11
Vocabularies and Ontologies for Biochemical reactions
Ontology of biochemical reactions %biochemical reaction %binding reaction %biotransformation reaction %non-catalytic reaction %photoinduced reaction %spontaneous reaction %catalytic reaction %enzymatic reaction %deoxyribozymatic reaction %ribozymatic reaction %abzymatic reaction %intramolecular catalysis reaction %conformation change reaction %molecular transport reaction %electron transfer reaction %excitation-energy transfer reaction
Enzymatic reactions
The Enzyme Nomenclature [ 14], published by NC-IUBMB, provides the oldest controlled
vocabulary for biochemical function. Not surprisingly, EC numbers are often (mis)used for
annotation of gene products. It is important to remember that the basis of the Enzyme
Nomenclature is the overall reaction catalysed [ 23] (cf. overall transformation classification
in organic chemistry [ 24]), but not reaction mechanism or any other specific property of an
enzyme. Nevertheless, other biological catalysts such as ribozymes, deoxyribozymes or
catalytic antibodies (abzymes) do not form a part of Enzyme Nomenclature.
EC numbers form a strict hierarchy of IsA relationships. That means, any one EC number
belongs to one and only one sub-subclass, which belongs to one and only one subclass, which
belongs to one and only one class. Historically, the EC number served as both unique
identifier (ID) and descriptor of the enzyme place in hierarchy. This dual function of EC
numbers is fairly limiting because it requires the unique place of enzyme in the hierarchy.
Proceedings of the 2003 International Chemical Information Conference 12
Subclasses in EC 1
%EC 1 oxidoreductases %EC 1.1 acting on the CH-OH group of donors %EC 1.1.1 with NAD+ or NADP+ as acceptor %EC 1.1.2 with a cytochrome as acceptor %EC 1.1.3 with oxygen as acceptor %EC 1.1.4 with a disulfide as acceptor %EC 1.1.5 with a quinone or similar compound as acceptor %EC 1.1.99 With other acceptors %EC 1.2 acting on the aldehyde or oxo group of donors %EC 1.2.1 with NAD+ or NADP+ as acceptor ... %EC 1.3 acting on the CH-CH group of donors %EC 1.4 acting on the CH-NH
2 group of donors
%EC 1.5 acting on the CH-NH group of donors %EC 1.6 acting on NADH or NADPH %EC 1.7 acting on other nitrogenous compounds as donors %EC 1.8 acting on a sulfur group of donors %EC 1.9 acting on a heme group of donors ...
However, an enzyme can be correctly classified in more than one way. E.g. Intramolecular
Oxidoreductases (EC 5.3) are as much oxidoreductases (EC 1) as isomerases (EC 5). In every
subclass of oxidoreductases, the acceptors form repeating series, therefore the alternative
grouping is feasible, e.g. EC 1.1.1, EC 1.2.1, … EC 1.18.1 can be classified in a ‘EC 1.x.1’
subclass of oxidoreductases with NAD+ or NADP+ as acceptor:
Alternative subclasses in EC 1
%EC 1 oxidoreductases %EC 1.x.1 with NAD+ or NADP+ as acceptor %EC 1.1.1 acting on the CH-OH group of donors %EC 1.2.1 acting on the aldehyde or oxo group of donors %EC 1.3.1 acting on the CH-CH group of donors %EC 1.4.1 acting on the CH-NH
2 group of donors
... %EC 1.18.1 acting on iron-sulfur proteins as donors %EC 1.x.2 with a heme protein as acceptor %EC 1.x.3 with oxygen as acceptor %EC 1.x.4 with a disulfide as acceptor %EC 1.x.5 with a quinone as acceptor %EC 1.x.7 with an iron–sulfur protein as acceptor %EC 1.x.6 with a nitrogenous group as acceptor %EC 1.x.8 with a flavin as acceptor %EC 1.x.99 with other acceptors
Chemical Vocabularies and Ontologies for Bioinformatics
13
The limit of four levels does not allow additional hierarchical IsA relationships which
otherwise can be introduced on the basis of natural hierarchy of chemical compound classes,
for example:
%EC 1.1.1.2 alcohol dehydrogenase (NADP+) %EC 1.1.1.91 aryl-alcohol dehydrogenase (NADP+) %EC 1.1.1.97 3-hydroxybenzyl-alcohol dehydrogenase
The extension of Enzyme Nomenclature beyond the traditional six classes of overall
transformations will include e.g. reactions affecting non-covalent bonds and transport
phenomena [ 25]. Further modification of Enzyme Nomenclature is required to accommodate
reaction mechanisms and enable multiple ancestry for enzymatic reactions. Classification of
reaction mechanisms consists of two orthogonal components: (i) fundamental reaction
mechanism classes, and (ii) catalytic mechanism classes. The catalytic mechanism, substrate
and allosteric effector specificities are examples of orthogonal features relevant to the
enzyme structure and can be inherited independently.
Reversibility
The biochemical reactions appear in most of Enzyme Nomenclature entries as if they were
reversible. (In the other entries, the verbal description of the reaction often does convey the
direction, e.g. EC 3.1.6.7 “Hydrolysis of the 2- and 3-sulfate groups of the polysulfates of
cellulose and charonin”.) This is in contrast with both experimental evidence (it is difficult to
make the peptidase to synthesise peptide bonds) and with higher order Enzyme
Nomenclature itself. Both class names (EC 3, Hydrolases; EC 6, Ligases) and subclass names
(e.g. EC 6.4 “Forming Carbon–Carbon Bonds”) imply the direction of the reaction. This
poses little problem for irreversible reactions or when the reaction can be catalysed by the
same enzyme in both directions. However, under physiological conditions (far from
equilibrium) the opposite reactions are often catalysed by different enzymes which are
nevertheless given the same EC number!
Succinate dehydrogenase (EC 1.3.5.1): succinate + Q → fumarate + QH2
Fumarate reductase (EC 1.3.5.1): fumarate + QH2 → succinate + Q
Electron transfer and excitation-energy transfer reactions
The term “pure electron transferase” was originally introduced as a name for a class of
flavoproteins (exemplified by flavodoxins) where the flavin is reduced and reoxidised in
Proceedings of the 2003 International Chemical Information Conference 14
one-electron steps [ 26, 27]. The meaning can be naturally extended to cover all the proteins
that catalyse electron transfer reactions only, such as cytochromes, ferredoxins and
cupredoxins. Although proteins involved in electron transfer are usually classified as
oxidoreductases, none of the ‘pure electron transferases’ is assigned an EC number.
Similarly, the excitation-energy transfer processes as in the antenna systems of
photosynthetic organisms are not covered by Enzyme Nomenclature.
Analogous to ‘traditionally understood’ metabolic pathways that consist of separate
enzymatic reactions, electron/exciton transfer reactions form electron/exciton transfer
pathways, that form integral part of metabolic pathways.
Transmembrane transport
A great number of fundamental biochemical reactions can be represented as
Xcompartment A → Xcompartment B
Most of these are not spontaneous and thus have to be facilitated by specific carriers or
transporters [ 28, 29]. Importantly, a distinct class (Energases) has been proposed to cover the
enzymes that catalyse chemical energy into mechanical energy [ 25]. Energases include
primary active transporters (directly utilising covalent bond energy to transport solutes
against a concentration gradient) and rotational molecular motors such as ATP synthase.
However, there is no reason to deny the other transporters their place in the enzyme
classification. Some electron transferases are also transmembrane transporters (TC 5 in [ 28]).
Non-enzymatic reactions
Finally, non-enzymatic biochemical reactions occur in vivo. In addition to other naturally
occurring ‘zymes’, the examples include Fenton chemistry [ 30], photoinduced
transformation of ergosterol to previtamin D3 and its subsequent thermal isomerisation to
vitamin D3 [ 31], etc. Again, these reactions form part of metabolic pathways.
Yet other reactions
Some biochemical reactions are neither catalytic nor spontaneous. For example, comment for
methylated-DNA—[protein]-cysteine S-methyltransferase (EC 2.1.1.63) reads: “Since the
acceptor protein is the ‘enzyme’ itself and the S-methyl-L-cysteine derivative formed is
relatively stable, the reaction is not catalytic.” The reaction proceeds through suicidal alkyl
Chemical Vocabularies and Ontologies for Bioinformatics
15
transfer from guanine O6 to the cysteine residue of the enzyme; therefore the enzyme should
be present in stoichiometric, not catalytic, amounts. Reaction catalysed by EC 2.1.1.63 fits
the definition of intramolecular catalysis [ 32]. On the one hand, the intramolecular catalyst is
a kind of catalyst, since “the catalyst is both a reactant and product of the reaction” [ 33]. But
if the direct result of the reaction is an inactivated catalyst, it makes the whole process
noncatalytical (according to the Gold Book).
The term “autocatalytic reaction” is often used in a meaning not consistent with the Gold
Book definition [ 34], e.g., “autocatalytic quinone-methide mechanism of protein
flavinylation” [ 35] or “autocatalytic formation of a thioether cross-link between the
active-site residues” in galactose oxidase [ 36]. These are in fact intramolecular catalysis
events.
(Bio)chemical Resources at the European Bioinformatics Institute
IntEnz
At the EBI, enzyme classification is collected in the Integrated relational Enzyme database
(IntEnz) [ 37], a joint project with the Trinity College Dublin (TCD), the Swiss Institute of
Bioinformatics (SIB) and the University of Cologne, supported by the NC-IUBMB.
Currently, IntEnz contains enzyme data curated and approved by the members of
NC-IUBMB. The goal is to create a single relational database containing all the relevant
enzyme data, including those from ENZYME [ 19] and BRENDA [ 38] databases.
chemPDB
The chemPDB service [ 39] provides access to the ligands and small molecule dictionary of
the Macromolecular Structure Database (MSD) developed at the EBI [ 40]. chemPDB is
described as “consistent and enriched library of ligands, small molecules and monomers that
are referred as residues and ‘HET groups’ in any PDB entry”. Each entry includes the
standard three-letter code, one or more molecule names, RCSB classification of molecules,
formula, stereo and non-stereo SMILES, fingerprint, 2-D diagram, idealised 3-D coordinates
(including calculated hydrogen atom positions) or 3-D coordinates from a pre-selected PDB
entry. In addition, many entries contain automatically generated IUPAC systematic names.
The search facility provides functionality for queries based on chemical equivalence,
similarity, substructure and superstructure.
Proceedings of the 2003 International Chemical Information Conference 16
RESID
The RESID Database of Protein Modifications is created and supported by John Garavelli [
41]. The RESID Database is a comprehensive collection of annotations and structures for
protein pre-, co- and post-translational modifications including amino-terminal,
carboxyl-terminal and peptide chain cross-links. RESID includes: systematic and alternate
names, atomic formulae and masses, enzyme activities generating the modifications (with
corresponding cross-references to GO), keywords, literature citations, protein sequence
database feature table annotations, 2-D structure diagrams and 3-D molecular models.
Release 34.01 (15 August 2003) contains 339 entries.
IUPHAR Receptor database
IUPHAR Receptor Database [ 42] is created at the EBI and is edited by the members of the
International Union of Pharmacology Committee on Receptor Nomenclature and Drug
Classification (NC-IUPHAR). It is implemented as a relational database containing official
NC-IUPHAR recommendations for receptor nomenclature and classification [ 43]. Future
developments will aim to expand the current database to include non-sensory G
protein-coupled receptors, nuclear receptors, ligand-gated ion channels, and voltage-gated
ion channels [ 44]. Although receptors and ion channels are macromolecular structures, their
classification according to ligand provides basis for reciprocal NC-IUPHAR classification
(ontology) of ligands according to their receptors. This classification is orthogonal to
biochemical ontologies mentioned before. It has to be noted that compounds of
pharmacological interest, apart from ‘small compounds’, include polypeptide-derived
hormones, toxins and other polypeptides with known pharmacological effects, which can be
cross-referenced to protein sequence databases.
Chemical Vocabularies and Ontologies for Bioinformatics
17
COMe
COMe (Co-Ordination of Metals, etc.) represents the ontology for bioinorganic centres in
complex proteins [ 45]. COMe consists of three types of entities: ‘bioinorganic motif’ (BIM),
‘molecule’ (MOL), and ‘complex proteins’ (PRX); each entity is assigned a unique
identifier. A BIM consists of at least one centre (metal atom, inorganic cluster, organic
molecule) and two or more endogenous and/or exogenous ligands. BIMs are represented as
one-dimensional (1-D) strings and 2-D diagrams. MOL entity represents ‘small molecule’
which, in complex with polypeptide(s), forms a functional protein. The PRX entity refers to
the functional protein as well as separate protein domains and subunits. The main groups of
complex proteins in COMe are (i) metalloproteins, (ii) organic prosthetic group proteins and
(iii) modified amino acid proteins. In addition to IsA and IsPartOf relationships, the
IsBoundTo relationship is introduced. It can occur only between MOL (child) and PRX
(parent). It is used because the molecule which IsBoundTo protein can be changed
chemically and, strictly speaking, become the different entity. The data are currently stored in
both XML format and a relational database and available via the Web [ 46].
Towards the unified dictionary of biochemical compounds
There is no authoritative database of biochemical compounds in the public domain. This is a
serious lack, as many biomedical databases need to refer to, or use data attached to,
biochemical compounds. Within the EBI alone these include MSD, Swiss-Prot, IntEnz,
IUPHAR Receptor Database and GO, as well as some databases developed in the EBI
research groups. More broadly, many other public biomedical databases have the same need.
As mentioned in the previous sections, several groups at the EBI build their own in-house
controlled vocabularies and/or chemical databases. To avoid unnecessary multiplication of
efforts, the project was initiated to create definitive, freely available dictionary of Chemical
compounds of Biological Interest (ChEBI). Data in ChEBI should be definitive in the sense
that terminology would be explicitly endorsed, where applicable, by IUPAC (systematic
names), NC-IUBMB (biochemical nomenclature) or NC-IUPHAR (drug classification).
More specifically, most immediate goal of ChEBI is to provide public reference for
biochemical compounds consisting of
• Substrates, products, cofactors, activators and inhibitors of enzymes (IntEnz)
Proceedings of the 2003 International Chemical Information Conference 18
• Ligands of receptors (IUPHAR Receptor Database)
• ‘Small molecules’ bound to macromolecules (chemPDB)
• Amino acid residues and their post-translational modifications (RESID)
• Metals and organic molecules bound to proteins as prosthetic groups (COMe)
• Molecules interacting with proteins (Swiss-Prot)
• Molecules involved in basic biological processes (GO)
Such a collection can be classified as a ‘small’ database. Our most optimistic estimate is that
in the next five years ChEBI will have no more that 50,000 curated entries. Therefore, it will
include only a small fraction of data provided by comprehensive commercial chemical
databases such as Beilstein [ 47] or CAS Registry [ 48] (cf. 4,519 compounds in IntEnz and
~5,000 in chemPDB vs more than 8 million substances in the Beilstein Database, with over
500,000 classified as ‘bioactive’). The same is true about depth of database (number of
properties likely to be documented).
In creating such a database the following principles should be held.
• Nothing held in the database must be proprietary or derived from a proprietary source
that would limit is free distribution/availability to anyone.
• Every data item in the database should be fully traceable and explicitly referenced to
the original source/version.
• Although the EBI will provide a web interface, the entirety of the data should be
available to all without constraint as, for example, PostgreSQL or MySQL table
dumps, mmCIF and XML (e.g. DAML+OIL, CML, etc.).
At this early stage our primary objective is to standardise biochemical terminology, but the
next step will necessarily concern structures. In accordance with the principles outlined
above, we plan to adopt open standards for chemical structure representation, such as IUPAC
Chemical Identifier (IChI) [ 49] for 2-D structures and CIF [ 50] for 3-D structures. The
connectivity and stereochemistry (2-D structure) for majority of small organic molecules in
ChEBI, (including isotope-labelled) could be stored as IChI. Other molecules will be linked
to corresponding 1-D or 3-D databases.
Chemical Vocabularies and Ontologies for Bioinformatics
19
Acknowledgements
I wish to thank my colleagues at the EBI, Sergio Contrino, Michael Darsow and Paula de
Matos. I thank Gillian Adams for her helpful comments and suggestions on the manuscript. I
am indebted to Michael Ashburner (University of Cambridge) and Steve Stein (NIST), and
this paper is the ultimate result of our e-mail exchanges.
References
1. Carugo, O. and Pongor, S. (2002) The evolution of structural databases. Trends
Biotechnol. 20, 498–501.
2. The Gene Ontology Consortium, http://www.geneontology.org/
3. The EMBL Nucleotide Sequence Database, http://www.ebi.ac.uk/embl/
4. The Swiss-Prot Protein Knowledgebase, http://www.ebi.ac.uk/swissprot/
5. The Protein Data Bank, http://www.pdb.org/
6. NCBI Taxonomy, http://www.ncbi.nlm.nih.gov/Taxonomy/
7. GO File Format Guide, http://www.geneontology.org/doc/GO.format.html
8. Open Biology Ontologies, http://obo.sourceforge.net/
9. The BioCyc Knowledge Library, http://BioCyc.org/
10. http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-e+[RESID:AA0283]
11. Metanomski, W.V., Ed. (1991) Compendium of Macromolecular Nomenclature (“The
Purple Book”). Blackwell Scientific Publications, Oxford.
12. The Cyclotide Webpage, http://www.cyclotide.com/
13. James, C.A., Weininger, D., Delany, J. (2003) Daylight Theory Manual,
http://www.daylight.com/dayhtml/doc/theory/theory.toc.html
14. Enzyme Nomenclature: Recommendations (1992) of the Nomenclature Committee of
the International Union of Biochemistry and Molecular Biology. Academic Press, San
Diego.
Proceedings of the 2003 International Chemical Information Conference 20
15. IUBMB Glossary of Chemical Names,
http://www.chem.qmul.ac.uk/iubmb/enzyme/glossary.html
16. LIGAND database of chemical compounds and reactions in biological pathways,
http://www.genome.ad.jp/ligand/
17. NIST Chemistry WebBook, http://webbook.nist.gov/chemistry/
18. Chemical Ontology,
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/obo/obo/ontology/biochemical/
19. The ENZYME database, http://www.expasy.org/enzyme/
20. The University of Minnesota Biocatalysis/Biodegradation Database (UM-BBD),
http://umbbd.ahc.umn.edu/
21. Panico, R., Powell, W.H. and Richer, J.C., Eds. (1993) A Guide to IUPAC
Nomenclature of Organic Compounds, Recommendations 1993 (“The Blue Book”).
Blackwell Scientific Publications, Oxford.
22. Physico-Chemical Ontology (FIX),
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/obo/obo/ontology/physicochemical/
23. Tipton, K. and Boyce, S. (2000) History of the enzyme nomenclature system.
Bioinformatics 16, 34–40.
24. Grossman, R.B. (1999) The Art of Writing Reasonable Organic Reaction Mechanisms.
Springer-Verlag, New York.
25. Purich, D.L. (2001) Enzyme catalysis: a new definition accounting for noncovalent
substrate- and product-like states. Trends Biochem Sci. 26, 417–421.
26. Hemmerich, P., Massey, V. and Fenner, H. (1972) Flavin and 5-deazaflavin: A
chemical evaluation of ‘modified’ flavoproteins with respect to the mechanisms of
redox biocatalysis. FEBS Lett. 84, 5–21.
27. Nomenclature Committee of the International Union of Biochemistry (NC-IUB)
(1991) Nomenclature of electron-transfer proteins. Recommendations 1989. Eur. J.
Biochem. 200, 599–611; http://www.chem.qmul.ac.uk/iubmb/etp/
Chemical Vocabularies and Ontologies for Bioinformatics
21
28. Transport Protein Database, http://tcdb.ucsd.edu/tcdb/
29. Nomenclature Committee of the International Union of Biochemistry and Molecular
Biology (NC-IUBMB) (2002) Membrane transport proteins. Recommendations 2002.
http://www.chem.qmul.ac.uk/iubmb/mtp/
30. Liochev, S.I. (1999) The mechanism of “Fenton-like” reactions and their importance
for biological systems. A biologist’s view. Metal Ions Biol. Syst. 36, 1–39.
31. Holick, M.F. (1995) Defects in the synthesis and metabolism of vitamin D. Exp. Clin.
Endocrinol. Diabetes 103, 219–227.
32. In McNaught, A.D. and Wilkinson, A., Eds. (1997) Compendium of Chemical
Terminology (“The Gold Book”), 2nd Edition. Blackwell Scientific Publications,
Oxford, p. 206.
33. Ibid., p. 58.
34. Ibid., p. 34.
35. Edmondson, D.E. and Newton-Vinson, P. (2001) The covalent FAD of monoamine
oxidase: structural and functional role and mechanism of the flavinylation reaction.
Antioxid. Redox Signal. 3, 789–806.
36. Firbank, S.J., Rogers, M., Hurtado-Guerrero, R., Dooley, D.M., Halcrow, M.A.,
Phillips, S.E.V., Knowles, P.F. and McPherson, M.J. (2003) Cofactor processing in
galactose oxidase. Biochem. Soc. Trans. 31, 506–509.
37. IntEnz: Integrated relational Enzyme database, http://www.ebi.ac.uk/IntEnz/
38. BRENDA: The Comprehensive Enzyme Information System,
http://www.brenda.uni-koeln.de/
39. MSD Ligand Chemistry, http://www.ebi.ac.uk/msd-srv/chempdb/
40. E-MSD: the European Bioinformatics Institute Macromolecular Structure Database,
http://www.ebi.ac.uk/msd/
41. The RESID Database of Protein Modifications,
ftp://ftp.ebi.ac.uk/pub/databases/RESID/
Proceedings of the 2003 International Chemical Information Conference 22
42. IUPHAR Receptor Database, http://www.ebi.ac.uk/iuphar-rd/
43. International Union of Pharmacology (2000) The IUPHAR Compendium of Receptor
Characterization and Classification, 2nd edition. IUPHAR Media, London.
44. Catterall, W.A., Chandy, K.G. and Gutman, G.A., Eds. (2002) The IUPHAR
Compendium of Voltage-gated Ion Channels. IUPHAR Media, Leeds.
45. Degtyarenko, K. and Contrino, S. (2003) COMe: the ontology of bioinorganic proteins.
The Chemistry Preprint Server (CPS: biochem/0307002).
46. COMe, http://www.ebi.ac.uk/come/
47. CrossFire Beilstein, http://www.mdl.com/products/knowledge/crossfire_beilstein/
48. CAS Registry, http://www.cas.org/EO/regsys.html
49. IUPAC Chemical Identifier (IChI) Project,
http://www.iupac.org/projects/2000/2000-025-1-800.html
50. IUCr Crystallographic Information File, http://www.iucr.org/iucr-top/cif/