advance publication by j-stage genes & genetic systems

34
Advance Publication by J-STAGE Genes & Genetic Systems Received for publication: December 15, 2016 Accepted for publication: January 29, 2017 Published online: March 24, 2017

Upload: others

Post on 12-Jun-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advance Publication by J-STAGE Genes & Genetic Systems

Advance Publication by J-STAGE

Genes & Genetic Systems

Received for publication: December 15, 2016

Accepted for publication: January 29, 2017

Published online: March 24, 2017

Page 2: Advance Publication by J-STAGE Genes & Genetic Systems

1

An artificial intelligence approach fit for tRNA gene studies in the era of big sequence data

Yuki Iwasaki1,†, Takashi Abe2,*, Kennosuke Wada1, Yoshiko Wada1, and Toshimichi Ikemura1,*

1 Department of Bioscience, Nagahama Institute of Bio-Science and Technology, 1266 Tamura-Cho, Nagahama, Shiga 526-0829, Japan 2 Department of Information Engineering, Faculty of Engineering, Niigata University, 2-8050, Ikarashi, Nishi-ku, Niigata 950-2181, Japan

* Corresponding authors

Takashi Abe

Department of Information Engineering, Faculty of Engineering, Niigata University, 2-8050, Ikarashi, Nishi-ku, Niigata-ken 950-2181, Japan

Tel: +81-25-262-6745

Fax: +81-25-262-6745

Email: [email protected]

Toshimichi Ikemura

Department of Bioscience, Nagahama Institute of Bio-Science and Technology, 1266 Tamura-Cho, Nagahama, Shiga-ken 526-0829, Japan

Tel: +81-749-64-8100

Fax: +81-749-64-8140

E-mail: [email protected]

†Present Address: National Institute of Genetics, 1111 Yata, Mishima, Shizuoka 411-8540, Japan

Running head: AI approach for tDNA studies

Key words: tRNA gene, metagenome, oligonucleotide composition, big data, phylogenetic marker

Page 3: Advance Publication by J-STAGE Genes & Genetic Systems

2

ABSTRACT

Unsupervised data mining capable of extracting a wide range of knowledge from big data

without prior knowledge or particular models is a timely application in the era of big

sequence data accumulation in genome research. By handling oligonucleotide compositions

as high-dimensional data, we have previously modified the conventional self-organizing map

(SOM) for genome informatics and established BLSOM, which can analyze more than ten

million sequences simultaneously. Here, we develop BLSOM specialized for tRNA genes

(tDNAs) that can cluster (self-organize) more than one million microbial tDNAs according to

their cognate amino acid solely depending on tetra- and pentanucleotide compositions. This

unsupervised clustering can reveal combinatorial oligonucleotide motifs that are responsible

for the amino acid-dependent clustering, as well as other functionally and structurally

important consensus motifs, which have been evolutionarily conserved. BLSOM is also

useful for identifying tDNAs as phylogenetic markers for special phylotypes. When we

constructed BLSOM with ‘species-unknown’ tDNAs from metagenomic sequences plus

‘species-known’ microbial tDNAs, a large portion of metagenomic tDNAs self-organized

with species-known tDNAs, yielding information on microbial communities in environmental

samples. BLSOM can also enhance accuracy in the tDNA database obtained from big

sequence data. This unsupervised data mining should become important for studying

numerous functionally unclear RNAs obtained from a wide range of organisms.

INTRODUCTION

Compilation of tRNA sequences and genes was originally established by Sprinzl and

coworkers (Sprinzl et al., 1978; Sprinzl and Vassolenko, 2005) and updated (tRNAdb;

http://trnadb.bioinf.uni-leipzig.de/) (Jühling et al., 2009). Using tRNAscan-SE, Genomic

tRNA Database (GtRNAdb; http://lowelab.ucsc.edu/GtRNAdb/) has been constructed for

complete and near complete genomes (Chan and Lowe, 2009). In addition, genomic

organization of eukaryotic tRNAs has been extensively studied, showing complex lineage-

specific variability (Bermudez-Santana et al., 2010). For both completely and partially

sequenced genomes, as well as vast numbers of metagenomic sequences from wide varieties

of environmental and clinical samples, we have constructed and updated a large-scale

database of tRNA genes (tDNAs) “tRNADB-CE” (http://trna.ie.niigata-u.ac.jp) (Abe et al.,

2011). Metagenomic sequences have attracted broad scientific and industrial interest, and

Page 4: Advance Publication by J-STAGE Genes & Genetic Systems

3

even short sequences obtained with new-generation sequencers (e.g., Sequence Read Achieve

in NCBI, http://www.ncbi.nlm.nih.gov/Traces/sra/) contain numerous full-length tDNAs

because of their short length. The number of tDNAs compiled in tRNADB-CE is already

large (1.7 million genes) and will undoubtedly increase more rapidly in the future. For

efficient knowledge discovery from such big data, new tools are important for promoting

promising research on genes for tRNAs and other RNAs including wide varieties of function-

unclear RNAs.

In the era of big sequence data obtained from high-throughput DNA sequencers, it

becomes important to establish an unsupervised data mining capable of extracting a wide

range of knowledge without prior knowledge, hypotheses, or particular models from

numerous genomic sequences (e.g., tDNA sequences) covering a wide range of species, for

which experimental studies other than DNA sequencing have been poorly done. Various

unsupervised data mining methods, such as K-means clustering and Fuzzy Art (Forgy, 1965;

Carpenter et al., 1991; Hastie et al., 2009), have been developed; and we have previously

developed an unsupervised clustering method “BLSOM (Batch-Learning Self-Organizing-

Map)” (Kanaya et al., 2001; Abe et al., 2003, 2005; Kikuchi et al., 2015), which can analyze

more than ten million genomic sequences simultaneously and allows acquisition of a wide

range of knowledge from big sequence data. For example, BLSOM with oligonucleotide (e.g.,

tetranucleotide) composition could cluster genomic sequence fragments (e.g., 1-kb sequences)

according to phylotype, and thus succeeded in phylogenetical classification of a large number

of metagenomic sequences (Uchiyama et al., 2005; Uehara et al., 2011; Nakao et al., 2013).

Oligonucleotides, such as penta- and hexanucleotides, often represent motif sequences

responsible for sequence-specific protein binding such as transcription-factor binding, and

their occurrences should differ from occurrences expected from mononucleotide composition

in each genome and among genomic portions within one genome. An analysis of the human

genome with pentanucleotide BLSOM has unexpectedly found evident enrichment of many

kinds of transcription-factor-binding motifs in pericentric heterochromatin regions (Iwasaki

et al., 2013), showing that BLSOM effectively detects characteristic, combinatorial

occurrences of functional motif oligonucleotides in genomic sequences with no prior

knowledge.

BLSOM is suitable for actualizing high-performance parallel-computing, and thus for the

analysis of a high dimensional big data. Here, we have tested its usefulness for data mining

Page 5: Advance Publication by J-STAGE Genes & Genetic Systems

4

from a large amount of tDNA sequences. Actually, BLSOM for tDNAs can reveal

combinatorial oligonucleotide motifs responsible for their amino acid-dependent clustering,

and BLSOM for species-unknown tDNAs obtained from metagenomic sequences plus

species-known microbial tDNAs can give information on microbial communities in

environmental samples.

MATERIALS AND METHODS

tDNA sequences To enhance the completeness and accuracy of tDNAs compiled in tRNADB-CE, three

computer programs, tRNAscan-SE (Lowe and Eddy, 1997), ARAGORN (Laslett and

Canback, 2004), and tRNAfinder (Kinouchi and Kurokawa, 2006) were used in combination,

since their algorithms partially differed and rendered somewhat different results. tDNAs

found concordantly by three programs were stored in tRNADB-CE and discordant cases

among programs were manually checked by experts in tRNA experimental fields (Abe et al.,

2011). The present study has constructed BLSOMs for tri-, tetra-, and pentanucleotide

compositions in tDNAs in tRNADB-CE (Version 7.0: Last update, 2014/01/25). Since a

portion of tDNAs lack the terminal CCA sequence, and one purpose of this study is to

establish a strategy for using tDNAs as phylogenetic markers, the CCA terminus sequence

has been excluded from BLSOM analyses in the present study. However, the inclusion of

CCA sequence for BLSOM analyses of tDNAs may enhance the amino acid-dependent

clustering for tDNAs containing the CCA sequence; the high level of CCA terminus has been

found on RNA-seq data (Findeiss et al., 2011) and importance of CCA terminus in the

genomic tag has been proposed (Weiner and Maizels, 1987). Therefore, the combinatorial use

of CCA-plus and -minus analyses may provide new additional information.

BLSOM

SOM (Self-Organizing Map) is an unsupervised clustering algorithm that nonlinearly maps

high-dimensional vectorial data onto a two-dimensional array of lattice points; i.e., a flexible

net that is spread into the multi-dimensional "data cloud" (Kohonen, 1982; Kohonen et al.,

1996). We previously modified the conventional SOM for genome informatics to make the

Page 6: Advance Publication by J-STAGE Genes & Genetic Systems

5

learning process and resulting map independent of the order of data input on the basis of

batch-learning SOM; BLSOM (Kanaya et al., 2001). The initial vectors were defined by

principal component analysis (PCA) instead of random values.

The frequency of each pentanucleotide obtained from vectorial data representing each

lattice point on Penta in Fig. 1A is calculated and normalized with the level expected from

the mononucleotide composition calculated from vectorial data representing the lattice point.

The observed/expected ratio is illustrated in red (overrepresented), blue (underrepresented),

and white (moderately represented) according to Abe et al. (2003) (Fig. 1D). Since there are

1,024 (= 45) different types of pentanucleotides, the occurrence of one type of

pentanucleotides in one tDNA is primarily 0 or 1. Thus, red and blue primarily show the

presence and absence of each pentanucleotide.

Parts-BLSOM

Computer programs used to find tDNAs can divide each tDNA sequence into the following

structural parts: 5’ side of acceptor-stem, D-arm, anticodon-arm, variable-arm (V-arm), T-

arm, 3’ side of acceptor-stem, and CCA-terminus. BLSOM, in which oligonucleotides found

in different structural parts are differentially counted, is designated Parts-BLSOM (for the

correspondence with the secondary cloverleaf structure, see Supplementary Fig. S1). Since

the CCA terminus sequence has been excluded from the present BLSOM analysis, PartsTri

and -Tetra treat 384 (=64 x 6) and 1536 (=256 x 6) variables, respectively. PartsPenta was not

included because of the V-arm’s short length for several amino acids; e.g., shorter than 5-nt

for Glu, Cys, and Gly. BLSOM programs can be obtained from our web site

(http://bioinfo.ie.niigata-u.ac.jp/?BLSOM). Distances of weight vectors between

neighbouring lattice points on BLSOM can be visualized as black levels with a U-matrix

method (Ultsch, 1993), and this provides information about similarity of oligonucleotide

composition in local areas on BLSOM (Iwasaki et al., 2013).

RESULTS

Oligonucleotide BLSOMs for bacterial tDNAs

Page 7: Advance Publication by J-STAGE Genes & Genetic Systems

6

Each tRNA has characteristic, combinatorial occurrences of various motif oligonucleotides

required to fulfil its function (e.g., binding to proper enzymes and rRNAs) and structural

formation (L-shaped form). To examine the usefulness of BLSOM for efficient knowledge

discovery from massive numbers of tDNAs, we have conducted BLSOMs for tri-, tetra-, and

pentanucleotide compositions in approximately 0.4 million tDNAs from more than 7000

bacterial genomes that are categorized as “Reliable tRNAs” in tRNADB-CE (Tri, Tetra, and

Penta in Fig. 1A, respectively); archaeal and fungal tDNAs will be analyzed later. Lattice

points containing tDNAs belonging to one amino acid are indicated in the color representing

the amino acid and those belonging to multiple amino acids are indicated in black. Most

lattice points, especially on Tetra and Penta, are colored, showing tDNAs to be separated

(self-organized) primarily by amino acid. Table 1 presents percentages of tDNAs located at

colored pure lattice points (i.e. lattice points containing tDNAs of one amino acid), showing

the amino acid-dependent clustering to be higher on Tetra and Penta than on Tri. Importantly,

the high level of amino acid-dependent clustering was obtained with no information other

than oligonucleotide composition. Thus, BLSOM should pick out characteristic combinations

of motif sequences required for proper recognition by various enzymes, such as aminoacyl-

tRNA synthetase (aaRS), and of sequences supporting proper L-shaped structures.

Fig. 1B marks lattice points containing tDNAs of individual amino acids separately on

Penta (for other amino acids, including selenocysteine, see Supplementary Fig. S2). tDNAs

for one amino acid form one or a few major territories, as well as many tiny satellite-type

spots. It should be noted that the major territory locating at the bottom of the Met panel is

composed solely of initiator Met tDNAs, but the upper major territory is composed of both

elongator Met tDNAs and Ile tDNAs containing anticodon CAT, which is enzymatically

converted to read ATA codon. When considering the biological significance of minor

territories and tiny satellites, the number of tDNAs in each lattice point is important. Hence

the vertical bar in Fig. 1C presents the number of Gly tDNAs (for other amino acids, see

Supplementary Fig. S2). Lattice points in two major Gly territories apparently contain many

tDNAs, and one tiny satellite located away from major territories also has multiple tDNAs

(arrowed in Fig. 1C), which represent 34 Gly-GCC tDNAs with two base differences and

derived from 6 Chlamydiae species listed in the Figure Legend. Similar satellite-type peaks

are observed for other amino acids (Supplementary Fig. S2) and represent isoacceptors of

various species primarily belonging to one phylogenetic family, which often differ in

Page 8: Advance Publication by J-STAGE Genes & Genetic Systems

7

sequence for few bases; examples of multiple alignment of their sequences are presented

along with their phylotypes in Supplementary Fig. S3; Arg-CGC for four Borrelia species,

Asn-GTT for six Thermotogae species. These types of noncanonical tDNAs should be

candidates for molecular phylogenetic markers representing a specific phylotype.

BLSOM clusters tDNAs according to amino acid, solely depending on oligonucleotide

composition, and visualizes major, minor, and noncanonical rare tDNAs. BLSOM is an

unsupervised clustering algorithm and allows us to explore causative factors responsible for

the amino acid-dependent clustering (self-organization) and to compare causative factors

pointed out by BLSOM with known molecular mechanisms experimentally proven for a

limited number of model organisms, such as the mechanisms reviewed by Marck and

Grosjean (2002). Importantly, we can address the following well-timed questions in the era of

big sequence data accumulation: to what range of phylotypes, can a certain known molecular

mechanism (e.g., sequence motifs recognized by an aaRS) be applied, and what types of

alternative mechanisms can be expected for other phylotypes? Since experimental studies

have been poorly conducted for most genomes recently sequenced, this in silico

characterization has become increasingly important, and BLSOM has powerful visualization

functions useful for addressing these questions.

Visualization of combinatorial occurrences of functionally important oligonucleotides

Functionally and structurally important sequences in tRNAs have been stably maintained

throughout evolution. Therefore, a wide range of species has the very closely related motifs,

while sequences outside the motifs have differentiated significantly. For example, to identify

cognate isoacceptors from a pool of tRNAs sharing a similar L-shaped structure, aaRS

recognizes a relatively small number of nucleotides as RNA code (Schimmel et al., 1993) and

identity element (Normanly and Abelson, 1989; Ibba and Söll, 2000; Ardell, 2010).

Oligonucleotide sequences maintained stably in a large number of isoacceptors from a wide

range of bacteria should be responsible and diagnostic for their amino acid-dependent

clustering (self-organization) on BLSOM. This prediction has been proven by using the

BLSOM capability to visualize diagnostic oligonucleotides contributing to self-organization,

as explained in MATERIAL AND METHODS. Red and blue in Fig. 1D and E show the

presence and absence of the respective pentanucleotide on Penta in Fig. 1A. Transitions

Page 9: Advance Publication by J-STAGE Genes & Genetic Systems

8

between red and blue for various pentanucleotides often coincide with borders between

territories of different amino acids, and Fig. 1D shows three examples of pentanucleotides

observed mainly in major territories of one amino acid and most likely related to identity

elements. The major red zone of ATAGA corresponds to the major territory of Arg in Fig. 1B;

this pentanucleotide exists in D-arm in a major portion of bacterial Arg tDNAs. The major

red zones of GATAA and CATAA correspond to major territories of Ile and Met tDNAs

listed in Fig. 1B; these pentanucleotides exist in anticodon-arm in these isoacceptors. In fact,

anticodon- and D(dihydroU)-arms have been reported to contain highly significant identity

elements; e.g., Ardell (2010) has systematically compiled identity determinants in

Proteobacteria. While a major type of identity element for one amino acid has been well

conserved in sequence throughout evolution, mechanisms for aaRS to recognize cognate

tRNAs seem to have diverged at a certain level, even among bacterial species (Marck and

Grosjean, 2002; Ardell, 2010). tDNAs with minor, noncanonical identity elements can be

detected by identifying tDNAs located outside red zones of the pentanucleotide representing

a canonical identity element. Such tDNAs should become phylogenetic markers for a

restricted phylogenetic lineage, and real examples will be mentioned later.

We next examine canonical-type oligonucleotides present in a large portion of bacterial

tDNAs. For example, the functionally important and well-conserved heptanucleotide

GGTTCGA in the Tѱ(pseudoU)C-arm (abbreviated as T-arm) is observed for a large

majority of bacterial tDNAs (Marck and Grosjean, 2002; Ardell, 2010) and therefore, three

constituent pentanucleotides of this heptanucleotide are observed in a major portion of lattice

points on Penta, as colored in red in Fig. 1E (TTCGA): two other pentanucleotides give

similar (but not identical) results. Small blue areas contain tDNAs that lack the consensus

sequence. Actually, the aforementioned Chlamydophila Gly-GCC tDNAs (arrowed in Fig.

1C) differ from the canonical heptanucleotide in T-arm at two bases and thus are located in a

blue zone (arrowed in Fig. 1E) for all three constituent pentanucleotides. This is one reason

that Chlamydophila Gly-GCC tDNAs form a satellite, odd peak apart from the Gly major

territory in Fig. 1C and shows that tDNAs with noncanonical sequences in T-arm will

become one type of candidates for phylogenetic markers. The occurrence levels of the three

pentanucleotides for each amino acid are presented in Supplementary Fig. S4. These

pentanucleotides are observed in more than 70% of bacterial tDNAs of Ala, Arg, Asn, Ile,

Lys, Met, Phe, Thr, and Val, but almost no tDNAs of Cys, Gln, Glu, Leu, Ser, and Try. For

Page 10: Advance Publication by J-STAGE Genes & Genetic Systems

9

Asp, Gly, His, Pro, and Trp, a portion of tDNAs have these pentanucleotides. Some tDNAs

belonging to the last five amino acids may become candidates for phylogenetic markers after

clarifying their existence/non-existence according to the phylogenetic group.

BLSOM specialized for tDNA research: Parts-BLSOM

Tri, Tetra, and Penta in Fig. 1 can cluster amino acid-specific tDNAs with no information

other than oligonucleotide composition and predict functionally and/or structurally important

motifs, as exemplified in Fig. 1D and E. This is a favorable feature of unsupervised data

mining. These BLSOMs, however, do not take into consideration the fundamental

characteristic of tRNAs that functionally and structurally important motifs exist in a specific

part of the molecule. Thus, an oligonucleotide that happens to have the same sequence to a

functional motif (e.g., identity element) but is located outside the functional site, cannot be

discriminated from the real functional element. Actually, even outside the characteristic

territories for one amino acid, there are small red zones for the pentanucleotide related to the

identity element of the amino acid (Fig. 1D), and these pentanucleotides have often been

found outside the identity element in noncognate tRNAs. We next add the following

information about tRNA molecules other than oligonucleotide composition and examine the

usefulness of this change.

Here, we construct a new BLSOM, in which oligonucleotides found in six different

structural parts are differentially counted. Tri- and tetranucleotide BLSOMs with 384 (= 64 x

6) and 1536 (= 256 x 6) variables have been constructed for bacterial tDNAs, as described in

Material and Methods; PartsTri and PartsTetra, respectively. Table 1 shows that the level of

amino acid-dependent clustering on PartsTetra (Fig. 2A) and PartsTri (data not shown) is

slightly higher than on the previous Penta for most amino acids. Fig. 2B shows that amino

acid-dependent clustering is simpler and minor territories and satellite spots are less evident

than in Fig. 1B and Supplementary S2B, supporting the view that the level of amino acid-

dependent self-organization has increased in Parts-BLSOMs (Table 1). Similar results have

been obtained for PartsTri (Supplementary Fig. S5). Fig. 2C presents the 3D view of the

number of Arg or Met tDNAs in each lattice point on PartsTri. The merit of Parts-BLSOM is

not only the slight increase of amino-acid dependent clustering, but this can provide the

Page 11: Advance Publication by J-STAGE Genes & Genetic Systems

10

following strategies for instructive knowledge discovery. The result of U-matrix listed in Fig.

2D will be explained later.

Strategies for revealing functionally and/or structurally important domains

Functionally important sequences, such as identity elements, have been experimentally

proven to differ often in sequence location among tRNAs belonging one isoacceptor group

and additionally among phylogenetic groups (Hou and Schimmel, 1988; McClain and Foss,

1988; Marck and Grosjean, 2002; Ardell, 2010). For an in silico prediction of their locations,

we have constructed PartsTri and PartsTetra, in which one of six functional parts is omitted.

The percentages of pure lattice points (i.e., lattice points containing tDNAs only of one amino

acid) on PartsTetra are listed (Fig. 3A). When omitting the anticodon-arm, the separation

level reduces significantly (< 75%) for His, Lys, and Trp, and less significantly for Arg, Asn,

Gln, Ile, Met, Phe, Pro, Thr, and Val. This reduction should reflect the contribution level of

each part for amino acid-dependent clustering and is consistent with locations of identity

elements experimentally proven for various model organisms (Normanly and Abelson, 1989;

Ibba and Söll, 2000; Marck and Grosjean, 2002; Ardell, 2010).

An alternative analysis is tri- and tetranucleotide BLSOM (Tri and Tetra) constructed

separately for each part. Fig. 3B lists the percentages of pure lattice points for each amino

acid on Tetra for each part. The anticodon-arm gives a good separation for all amino acids (>

90%), but D-arm gives a good separation (> 90%) only for Leu, Ser, and Tyr, and a

significant level of separation (> 40%) for Arg, Asp, Cys, Glu, and Pro; T-arm gives a

significant level (> 40%) for Asn, His, and Pro. These contribution levels are again consistent

with the results for identity elements experimentally proven for model bacteria. V-arm gives

a good separation (> 80%) only for Leu, Ser, and Tyr, showing that not only the sequence,

but also the size of each part should significantly affect their amino acid-dependent clustering.

The reason why both D- and V-arms give the high contribution to this separation for the class

II bacterial tRNAs (Leu, Ser, and Tyr) should relate to the finding that the D-loop plays a key

role in recognizing cognate tRNAs among the class II tRNAs (Asahara et al., 1993, 1998).

BLSOM for species-known plus species-unknown tDNAs

Page 12: Advance Publication by J-STAGE Genes & Genetic Systems

11

A big source for finding tDNAs is the massive number of metagenomic sequences derived

from a wide range of environmental and clinical samples, which have been compiled in

International DNA Data Banks (INSDC: DDBJ/ENA/NCBI). Since metagenomic sequences

have attracted broad scientific, industrial, and medical interests, tRNADB-CE has included

tDNAs obtained from metagenomic sequences (abbreviated as metagenomic tDNAs); and

these metagenomic tDNAs are analyzed here with BLSOM. Since metagenomic sequences

should be derived not only from bacteria, but also archaea and fungi, we have constructed

PartsTetra with species-unknown metagenomic tDNAs plus species-known bacterial,

archaeal, and fungal tDNAs; 0.6 million tDNAs in total (Both in Fig. 4A).

Species-unknown metagenomic and species-known microbial tDNAs are visualized

separately in Metagenome and Known in Fig. 4A. Amino acid-dependent clustering is

apparent, but separation patterns are more complex than those for species-known bacterial

tDNAs listed in Fig. 1A, and more black lattice points appear in Fig. 4A than 1A. A major

portion of black lattice points are observed for metagenomic tDNAs (Metagenome in Fig. 4A)

while a minor portion is observed also for species-known tDNAs (Known in Fig. 4A). Fig.

4B separately marks lattice points containing metagenomic and species-known tDNAs of Ala

and Asn (for other amino acids, see Supplementary Fig. S5). Detailed inspection of the

species-known tDNAs belonging to black lattice points in Fig. 4A has revealed these to be

primarily archaeal and fungal tDNAs, showing that BLSOM has separated archaeal and

fungal tDNAs from bacterial tDNAs and a significant level of metagenomic tDNAs should be

derived from archaea and fungi. The observation that a large portion of archaeal and fungal

tDNAs are located in black lattice points indicates that their self-organization depends largely

on their sequence characteristics distinct from bacterial tDNAs, rather than distinctions

between amino acids. To study amino acid-dependent clustering of archaeal and fungal

tDNAs, BLSOMs have to be constructed only for archaeal and fungal tDNAs.

The vertical bar in Fig. 4C presents the number of metagenomic and species-known

tDNAs of Ala and Asn. Fig. 4C shows, more clearly than Fig. 4B, that the metagenomic

tDNAs located apart from the major territories and primarily representing archaeal and fungal

tDNAs are more abundant than species-known tDNAs. Furthermore, locations of very high

peaks in the major territories differ between metagenomic and species-known tDNAs. This

can be more clearly shown by the following analysis on tDNAs of each amino acid.

Page 13: Advance Publication by J-STAGE Genes & Genetic Systems

12

BLSOM for each amino acid

Dick et al. (2009) has successfully applied the U-matrix method (Ultsch, 1993) of an

oligonucleotide-SOM to the phylogenetic clustering of environmental metagenomic

sequences. The U-matrix listed in Fig. 2D visualizes the dissimilarity level of oligonucleotide

composition between neighbouring lattice points as a blackness level; dark black lines turn

out to correspond primarily to borders between different amino-acid territories, showing clear

dissimilarity of oligonucleotide compositions in tDNAs responding to different amino acids.

In addition, even within a major territory of one amino acid, many partitions surrounded by

pale black lines are observed and appear to be primarily attributable to phylogenetic

differences. To investigate the phylotype-dependent separation in more detail, we have

constructed a BLSOM for each amino acid for species-known plus species-unknown tDNAs

(Fig. 5). On the All panel in Fig. 5A, lattice points containing Leu tDNAs from only bacterial,

archaeal, fungal, or metagenomic sequences are colored in blue, red, green, or gray,

respectively; those containing tDNAs from more than one category are marked in black. On

the Bacteria, Archaea, and Fungi panel, lattice points containing metagenomic tDNAs (gray)

plus bacterial, archaeal, or fungal tDNAs are separately colored, as described for the All

panel. A large portion of lattice points on the Bacteria panel are marked in black, showing

that many metagenomic tDNAs are clustered (self-organized) with known bacterial tDNAs,

predicting phylogenetic attribution of metagenomic tDNAs. On the Bacteria panel, some

clear gray contiguous areas contain metagenomic (but not bacterial) tDNAs, and some gray

areas contain archaeal and fungal tDNAs (black on the Archaea or Fungi panel), providing

phylogenetic attribution of these metagenomic tDNAs. On the U-matrix panel in Fig. 5A,

many white or pale black areas are surrounded by dark black circles. White and pale black on

the U-matrix shows similar oligonucleotide compositions between neighbouring lattice points,

i.e., between tDNAs located in neighbouring lattice points. Therefore, metagenomic tDNAs

within a white and pale black zone surrounded by a dark black circle can be phylogenetically

assigned by referring to species-known tDNAs colocalizing in this zone, as described by Dick

et al. (2009). A significant portion of metagenomic tDNAs is also located apart from species-

known tDNAs (gray contiguous areas in the All panel), showing these tDNAs to be derived

mainly from poorly studied genomes and thus pointing out the environmental sources

harbouring novel genomes.

Page 14: Advance Publication by J-STAGE Genes & Genetic Systems

13

The vertical bar in Fig. 5B presents the number of species-known and metagenomic

tDNAs of three amino acids. Locations of even very high peaks differ between species-

known and metagenomic tDNAs. Then, we focus on very high peaks observed for

metagenomic tDNAs. In the case of Ala, the highest peak (marked as No. 1) contains 1205

metagenomic tDNAs primarily obtained from marine samples plus two tDNAs of Candidatus

Pelagibacter (an oceanic carbon recycling bacteria); the peak No. 2 contains 171 tDNAs

primarily obtained from hot spring samples plus 2 tDNAs of Synechococcus sp. JA-3-3Ab

(Cyanobacteria bacterium Yellowstone A-Prime); and the peak No. 3 contains 159 marine

metagenomic tDNAs but no species-known tDNAs. In the case of Pro, No. 1 contains 244

marine metagenomic tDNAs plus 1 Candidatus tDNA; No. 2 contains 152 marine

metagenomic tDNAs plus 9 Chlorobi tDNAs; and No. 3 contains 287 marine tDNAs but no

species-known tDNAs. In the case of Leu, No. 1 contains 196 marine metagenomic tDNAs

plus 4 Chlorobi tDNAs; No. 2 and 3 contains 316 and 309 marine metagenomic tDNAs but

no species-known tDNAs. Results for each amino acid can provide phylogenetic information

of metagenomic sequences and assign novel tDNAs with new types of sequence

characteristics. Importantly, tDNAs that form peaks composed of multiple tDNAs should be

reliable tDNAs even though these have noncanonical characteristics. This can specify a large

number of new types of tDNAs, which have been poorly characterized. The analysis of big

sequence data, such as those obtained from metagenomic samples, can provide this type of

novel information.

DISCUSSION

Characteristics of unsupervised data mining

The present in silico findings obtained from more than 7000 bacterial genomes, for most of

which molecular studies other than DNA sequencing have been poorly done, can be

connected with molecular mechanisms that have been experimentally proven for a limited

number of model organisms; e.g., those reviewed by Marck and Grosjean (2002). In their

review, 50 genomes were selected to avoid an overrepresentation of phylogenetically too

closely related organisms and span the widest range of living areas; over 4000 tDNAs were

extracted, analyzed, and compared. In our study, as is typical for big data analyze s, we have

used all available bacterial data without particular filtration processes and thus included

Page 15: Advance Publication by J-STAGE Genes & Genetic Systems

14

genomes of closely related bacteria, including those of different strains of one species; in

total, 0.4 million tDNAs from more than 7000 bacterial genomes. These two distinct analyses

are complementary to each other, and BLSOM can clarify what range of species the

experimentally proven mechanisms are applicable to and can point out inapplicable

phylotypes.

Phylogenetic markers useful for short metagenomic sequences When searching for a certain genome of particular interest (e.g., for industrial usability) by

surveying a massive number of short metagenomic sequences, phylogenetic marker tDNAs

should become very useful because of their short length. Our group has started to search for

tDNAs that are useful as phylogenetic markers, especially for rare genomes, and has planned

to publish such markers in tRNADB-CE. The present study shows that the BLSOM with

species-known tDNAs plus species-unknown metagenomic tDNAs (Fig. 4) can provide a tool

for studying a microbial community in an ecosystem. When analysing a dataset composed

mainly of sequences shorter than 100 bp, this strategy is useful since conventional

phylogenetic tree methods cannot be properly applied to most short sequences; i.e. it is

impossible to construct reliable phylogenetic trees for most of these short sequences. If the

dataset is composed mainly of sequences longer than 500 bp, BLSOMs with tri- and

tetranucleotide compositions in all genomic fragments should be more suitable than tDNA-

BLSOM, because all genomic sequences are informative (Abe et al., 2005;Nakao et al., 2013).

It should also be mentioned that horizontal gene transfers between different species are a

general characteristic of microbial genomes. Therefore, we may not find phylogenetic

markers with 100% accuracy, because informatics methods, including sequence homology

searches, most likely assign the horizontally transferred genes to the donor, but not the

recipient genome. When noncanonical tDNAs are found concurrently in restricted members

of phylogenetically distant groups (e.g., different classes and families), the genes may

represent horizontally transferred genes or products of convergent evolution. Phylogenetic

marker tDNAs must be used, taking these into consideration.

Strategies for enhancing the accuracy of a large-scale tRNA database

Page 16: Advance Publication by J-STAGE Genes & Genetic Systems

15

As described in Materials and Methods, to enhance accuracy in compiling tDNAs in

tRNADB-CE, three computer programs have been used in combination, since their

algorithms partially differ and render somewhat different results (Abe et al., 2011). For tDNA

candidates predicted by only one or two computer programs, experts in tRNA experimental

researches have manually checked. Searching for the minimum anticodon set for a

completely sequenced genome (Osawa, 1995; Marck and Grosjean, 2002) is an important

check process (Abe et al., 2011). Another manual check applicable even to partially

sequenced genomes is to examine whether the candidates have been found iteratively in

closely related species; this process has become increasingly useful because the genomes of

many closely related species and even of different strains belonging to one species have been

sequenced. When the same or almost the same noncanonical sequences have been found

repeatedly, the tDNAs are included in the Reliable tRNA category (Abe et al., 2011), basing

on the knowledge that functionally important genes have been stably maintained throughout

evolution. Furthermore, accumulation of a large number of metagenomic sequences has

progressively increased the reliability of this verification strategy, by pointing to promising

characteristics of big data.

For creating a large-scale and high-quality database, it becomes important to find errors

that have slipped into the database, including those caused by DNA sequencing errors. As

mentioned above, tDNAs found concordantly by all three programs have been stored in

tRNADB-CE after brief anticodon checking. While this automatic compilation is

indispensable for surveying a huge number of genomic sequences accumulated in INSDC, a

new strategy for identifying erroneous cases is required for quality enhancement. Orphan

tDNAs located apart from correspondent amino acid territories on BLSOM may be

candidates for erroneous cases, because the present data mining method has pointed out their

sequence irregularity. Such cases should be manually checked by experts, even though three

computer programs have concordantly assigned them. In contrast, if tiny spots apart from

their correspondent major territories harbour multiple tDNAs (e.g., peaks in Fig. 1C, 2C, 4C,

and 5B), especially derived from phylogenetically related species, they should represent real

tDNAs, even though they have noncanonical sequence characteristics. These tDNAs will

become phylogenetic markers with high specificity for the respective phylotypes. In

tRNADB-CE, such tDNAs will be noted in a column that has been provided for comments on

each tDNA (Abe et al., 2011).

Page 17: Advance Publication by J-STAGE Genes & Genetic Systems

16

When constructing tRNADB-CE, we have encounted a significant number of cases where

three programs predicted different functional segmentation for one tDNA candidate although

the programs concordantly assign it as tDNA. This may indicate a demerit of Parts-BLSOM

because it requires the information concerning the functional segmentation. When analysing

metagenomic sequences, which should include various novel genomes, the ordinary BLSOM

may be useful because this does not require this prior information. Undoubtedly, their

combinatorial use should be a better choice, and this will predict the proper functional

segmentation of the tDNA.

CONCLUSION

Unsupervised data mining, which can extract a wide range of knowledge from big data

without prior knowledge or particular models, is well-timed in the era of big sequence data

accumulation in genome research. Importantly, the unsupervised data mining such as

BLSOM can provide the least expected knowledge. In addition, to gain a wide range of

knowledge efficiently from big data, it is important to view all data simultaneously on one

map and to focus on a specific data category by using strong visualization powers.

Oligonucleotide BLSOM, which can analyze more than ten million sequences at once, is

suitable for unveiling novel knowledge hidden within big sequence data, providing a timely

tool for a wide range of genome researches, which have been prompted by the remarkable

progress of high-throughput sequencing technology.

CONFLICT OF INTERESTS

The authors declare that there is no conflict of interests regarding the publication of this paper.

ACKNOWLEDGEMENT

We thank Drs. Hachiro Inokuchi and Yuko Yamada in Nagahama Institute of Bio-Science

and Technology and Dr. Akira Muto in Hirosaki University for valuable discussions. We also

thank Mr. Yudai Miyake in Nagahama Institute of Bio-Science and Technology for

preliminary BLSOM analyses of tDNAs. This work was supported by a Grant-in-Aid for

Page 18: Advance Publication by J-STAGE Genes & Genetic Systems

17

Scientific Research (C: no. 23500371 and 26330334) and a Grant-in-Aid for Publication of

Scientific Research Results (no. 15HP7006), from the Ministry of Education, Culture, Sports,

Science and Technology, Japan, and by a NIG Collaborative Research Program (A).

REFERENCES

Abe, T., Ikemura, T., Sugahara, J., Kanai, A., Ohara, Y., Uehara, H., Kinouchi, M., Kanaya, S., Yamada, Y., Muto, A., et al. (2011) tRNADB-CE 2011: tRNA gene database curated manually by experts. Nucleic Acids Res. 39, D210-D213.

Abe, T., Kanaya, S., Kinouchi, M., Ichiba, Y., Kozuki, T., and Ikemura, T. (2003) Informatics for unveiling hidden genome signatures. Genome Res. 13, 693-702.

Abe, T., Sugawara, H., Kinouchi, M., Kanaya, S., and Ikemura, T. (2005) Novel phylogenetic studies of genomic sequence fragments derived from uncultured microbe mixtures in environmental and clinical samples. DNA Res. 12, 281-290.

Ardell, D.H. (2010) Computational analysis of tRNA identity. FEBS Lett. 584, 325-333.

Asahara, H., Himeno, H., Tamura, K., Hasegawa T, Watanabe, K., and Shimizu, M. (1993) Recognition nucleotides of Escherichia coli tRNALeu and its elements facilitating discrimination from tRNASer and tRNATyr. J.Mol. Biol. 231, 219-229.

Asahara, H., Nameki, N., and Hasegawa, T. (1998) In vitro selection of RNAs aminoacylated by Escherichia coli leucyl-tRNA synthetase. J.Mol. Biol. 283, 605-618.

Bermudez-Santana, C., Attolini, C.S., Kirsten, T., Engelhardt, J., Prohaska, S.J., Steigele, S., and Stadler, P.F. (2010) Genomic organization of eukaryotic tRNAs. BMC Genomics 11, 270.

Carpenter, G.A., Grossberg, S., and Rosen, D.B. (1991) Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Netw. 4, 759-771.

Chan, P.P., and Lowe, T.M. (2009) GtRNAdb: a database of transfer RNA genes detected in genomic sequence. Nucleic Acids Res. 37, D93-D97.

Dick, G.J., Andersson, A.F., Baker, B.J., Simmons, S.L., Thomas, B.C., Yelton, A.P., and Banfield, J.F. (2009) Community-wide analysis of microbial genome sequence signatures. Genome Biol. 10, R85.

Findeiss, S., Langenberger, D., Stadler, P.F., and Hoffmann, S. (2011) Traces of post-transcriptional RNA modifications in deep sequencing data. Biol. Chem. 392, 305-313.

Forgy, E.W. (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21, 768-769.

Page 19: Advance Publication by J-STAGE Genes & Genetic Systems

18

Hastie, T., Tibshirani, R., and Friedman, J. (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Second Edition). Springer-Verlag, New York.

Hou, Y.M., and Schimmel, P. (1988) A simple structural feature is a major determinant of the identity of a transfer RNA. Nature 12, 140-145.

Ibba, M., and Söll, D. (2000) Aminoacyl-tRNA synthesis. Annu. Rev.Biochem., 69, 617-650.

Iwasaki, Y., Wada, K., Wada, Y., Abe, T., and Ikemura, T. (2013) Notable clustering of transcription-factor-binding motifs in human pericentric regions and its biological significance. Chromosome Res. 21, 461-474.

Jühling, F., Mörl, M., Hartmann, R.K., Sprinzl, M., Stadler, P.F., and Pütz, J. (2009) tRNAdb 2009: compilation of tRNA sequences and tRNA genes. Nucleic Acids Res. 37, D159-D162.

Kanaya, S., Kinouchi, M., Abe, T., Kudo, Y., Yamada, Y., Nishi, T., Mori, H., and Ikemura, T. (2001) Analysis of codon usage diversity of bacterial genes with a self-organizing map (SOM) - characterization of horizontally transferred genes with emphasis on the E. coli O157 genome. Gene 276, 89-99.

Kikuchi, A., Ikemura, T., and Abe, T. (2015) Development of self-compressing BLSOM for comprehensive analysis of big sequence data. Biomed Res. Int. 2015, 506052.

Kinouchi, M., and Kurokawa, K. (2006) tRNAfinder: A software system to find all tRNA genes in the DNA sequence based on the cloverleaf secondary structure. J. Comput. Aided Chem. 7, 116-126.

Kohonen, T. (1982) Self-organized formation of topologically correct feature maps. Biol. Cybern. 43, 59-69.

Kohonen, T., Oja, E., Simula, O., Visa, A., and Kangas, J. (1996) Engineering applications of the self-organizing map. Proc. IEEE 84, 1358-1384.

Laslett, D., and Canback, B. (2004) ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res. 32, 11-16.

Lowe, T.M., and Eddy, S.R. (1997) tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955-964.

Marck, C., and Grosjean, H. (2002) tRNomics: Analysis of tRNA genes from 50 genomes of Eukarya, Archaea, and Bacteria reveals anticodon-sparing strategies and domain-specific features. RNA 8, 1189-1232.

McClain, W.H., and Foss, F. (1988) Changing the identity of a tRNA by introducing a G-U wobble pair near the 3' acceptor end. Science 240, 793-796.

Page 20: Advance Publication by J-STAGE Genes & Genetic Systems

19

Nakao, R., Abe, T., Nijhof, A.M., Yamamoto, S., Jongejan, F., Ikemura, T., and Sugimoto, C. (2013) A novel approach, based on BLSOMs (Batch Learning Self-Organizing Maps), to the microbiome analysis of ticks. ISME J. 7, 1003-1015.

Normanly, J., and Abelson, J. (1989) tRNA Identity. Annu. Rev.Biochem. 58, 1029-1049.

Osawa, S. (1995) Evolution of the Genetic Code. Oxford University Press, New York.

Schimmel, P., Giegé, R., Moras, D., and Yokoyama, S. (1993) An operational RNA code for amino acids and possible relationship to genetic code. Proc. Natl. Acad. Sci. USA 90, 8763-8768.

Sprinzl, M., Grüter, F., and Gauss, D.H. (1978) Collection of published tRNA sequences. Nucleic. Acids Res. 5, r15-r27.

Sprinzl, M., and Vassilenko, K.S. (2005) Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res. 33, D139-D140.

Uchiyama, T., Abe, T., Ikemura, T., and Watanabe, K. (2005) Substrate-induced gene-expression screening of environmental metagenome libraries for isolation of catabolic genes. Nat. Biotechnol. 23, 88-93.

Uehara, H., Iwasaki, Y., Wada, C., Ikemura, T., and Abe, T. (2011) A novel bioinformatics strategy for searching industrially useful genome resources from metagenomic sequence libraries. Genes Genet. Syst. 86, 53-66.

Ultsch, A. (1993) Self organized feature maps for monitoring and knowledge acquisition of a chemical process. In Proc. ICANN’93 Int. Conf. on Artificial Neural Networks (eds.: S. Gielen and B. Kappen) pp. 864-–867. Springer, London

Weiner, A.M., and Maizels, N. (1987) tRNA-like structures tag the 3' ends of genomic RNA molecules for replication: implications for the origin of protein synthesis. Proc. Natl. Acad. Sci. USA 84, 7383-7387.

Page 21: Advance Publication by J-STAGE Genes & Genetic Systems

20

Table 1. The percentages of tDNAs located at colored pure lattice points on BLSOMs

Tri Tetra Penta PartsTri PartsTetraAla 0.900 0.965 0.958 0.991 0.984 Arg 0.769 0.930 0.956 0.969 0.971 Asn 0.831 0.938 0.965 0.980 0.994 Asp 0.943 0.985 0.988 0.986 0.990 Cys 0.813 0.945 0.951 0.967 0.982 Gln 0.863 0.946 0.945 0.912 0.967 Glu 0.945 0.979 0.975 0.983 0.983 Gly 0.892 0.967 0.981 0.983 0.988 His 0.788 0.939 0.910 0.955 0.957 Ile 0.888 0.975 0.938 0.951 0.975

Leu 0.884 0.973 0.984 0.987 0.983 Lys 0.773 0.929 0.959 0.958 0.971 Met 0.864 0.958 0.956 0.966 0.981 Phe 0.833 0.983 0.983 0.993 0.989 Pro 0.818 0.935 0.958 0.979 0.989 Ser 0.925 0.952 0.963 0.983 0.987 Thr 0.782 0.918 0.901 0.969 0.976 Trp 0.827 0.896 0.917 0.908 0.919 Tyr 0.907 0.979 0.988 0.988 0.988 Val 0.864 0.949 0.941 0.974 0.972

Page 22: Advance Publication by J-STAGE Genes & Genetic Systems

21

LEGENDS TO FIGURES

Fig. 1. Oligonucleotide-BLSOM for bacterial tDNAs. (A) BLSOM for tri-, tetra-, and

pentanucleotide compositions (Tri-, Tetra-, and Penta). Lattice points containing tDNAs of

multiple amino acids are indicated in black, and those containing tDNAs of a single amino

acid are colored as follows: Ala (■), Arg (■), Asn (■), Arp (■), Cys (■), Gln (■), Glu (■), Gly

(■), His (■), Ile (■), Leu (■), Lys (■), Met (■), Phe (■), Pro (■), Ser (■), Thr (■), Trp (■),

Tyr (■), and Val (■). (B) Lattice points containing tDNAs of individual amino acids on Penta

in Fig. 1A are visualized separately with the color used there. (C) Number of Gly tDNAs in

each lattice point on Penta is presented by the height of the vertical bar. Lattice points

containing multiple tDNAs, but not one or a few tDNAs, turn out to be detectable. A satellite

single bar (arrowed) located between two major Gly territories is composed of 34 Gly-GCC

tDNAs belonging to 6 Chlamydophila species: C. caviae, C. felis, C. muridarum, C.

pneumonia, and C. trachomatis. (D) Examples of diagnostic pentanucleotides responsible for

amino-acid dependent clustering on Penta in Fig. 1A. The occurrence of each pentanucleotide

for each lattice point was calculated and normalized with occurrence expected from the

mononucleotide composition for the respective lattice point (Abe et al., 2005). This

observed/expected ratio is indicated in color; red (overrepresented), blue (underrepresented),

and blank (intermediate). (E) An example of pentanucleotides, TTCGA, observed for most

bacterial tDNAs. Lattice points are marked as described in B.

Fig. 2. Parts-BLSOM for bacterial tDNAs. (A) PartsTetra. Lattice points are marked as

described in Fig. 1A. (B) Lattice points containing tDNAs of individual amino acids on

PartsTetra are marked as described in Fig. 1B. On the Met panel, initiator Met tDNAs are

mostly located at the lower left part; elongator Met tDNAs and Ile tDNAs containing

anticodon CAT in DNA sequence are not separated from each other. (C) 3D-view. Number

of Arg and Met tDNAs in each lattice point is presented by the height of the vertical bar. (D)

The distances of vectorial data between neighbouring lattice points on PartsTetra are

visualized as blackness levels with a U-matrix method as described by Iwasaki et al. (2013).

Fig. 3. Contribution level of each functional part for amino-acid dependent clustering. (A)

The percentages of pure lattice points for each amino acid on PartsTetra, in which one

functional part is omitted, is presented by a vertical bar colored as follows: no omission (■),

omission of 5’ acceptor (■), D-arm (■), anticodon-arm (■), V-arm (■), T-arm (■), and 3’

Page 23: Advance Publication by J-STAGE Genes & Genetic Systems

22

acceptor (■). (B) The percentage of pure lattice points on Tetra for all parts and each part is

presented for each amino acid by a vertical bar colored as follows: 5’ acceptor (■), D-arm (■),

anticodon-arm (■), V-arm (■), T-arm (■), and 3’ acceptor (■).

Fig. 4. PartsTetra for metagenomic plus species-known microbial tDNAs. (A) Both; Lattice

points are marked for both types of tDNAs as described in Fig. 1A. Metagenome or Known;

lattice points containing only metagenomic or species-known microbial tDNAs are marked as

described in Fig. 1A. (B) Lattice points containing metagenomic or species-known microbial

tDNAs of Ala and Asn are marked as described in Fig. 1B. (C) 3D-view. Number of

metagenomic and species-know tDNAs of Ala and Asn in each lattice point on PartsTetra is

presented by the height of the vertical bar.

Fig. 5. PartsTetra for metagenomic plus species-known microbial tDNAs of one amino acid.

(A) 2D-view. All; lattice points containing tDNAs derived from only bacterial, archaeal,

fungal, or metagenomic sequences are colored in blue, red, green, or gray, respectively, and

those containing tDNAs from more than one category are marked in back. Bacteria, Archaea,

or Fungi panel; lattice points containing metagenomic tDNAs plus bacterial, archaeal, or

fungal tDNAs are separately colored as described for the All panel. (B) 3D-view. Number of

metagenomic and species-known tDNAs in each lattice point on PartsTetra is presented by

the height of the vertical bar.

Page 24: Advance Publication by J-STAGE Genes & Genetic Systems

Fig. 1

Tri Tetra PentaA

Gly

ATAGA CATAAGATAA

Arg Ile Met C Penta

D Penta

B Penta

E PentaTTCGA

Page 25: Advance Publication by J-STAGE Genes & Genetic Systems

Fig. 2

ProMetArg

AsnAlaBPartsTetraA

D U-matrixC Arg Met

Page 26: Advance Publication by J-STAGE Genes & Genetic Systems

(A)

(B)

0

20

40

60

80

100

Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val

All parts omission of 5’ acceptor omission of D-armomission of Anticodon-arm ommision of V-arm omission of T-armomission of 3'acceptor

(%)

(%)

Fig. 3

0

20

40

60

80

100

Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val

5' acceptor D-arm Anticodon-arm V-arm T-arm 3'acceptor

A

B

Page 27: Advance Publication by J-STAGE Genes & Genetic Systems

Fig. 4KnownMetagenomeBothA

AsnAla AsnAlaMetagenome KnownB Metagenome Known

AlaAlaMetagenome KnownC AsnAsnMetagenome Known

Page 28: Advance Publication by J-STAGE Genes & Genetic Systems

U-matrixFungiBacteria ArchaeaAllLeu

Fig. 5A

Pro Meta

Pro Known

Ala Meta

Ala Known

B

1

23

Leu Meta

Leu Known

123

12

3

Page 29: Advance Publication by J-STAGE Genes & Genetic Systems

LEGENDS TO SUPPLEMENTARY FIGURES

Supplementary Fig. S1. Secondary cloverleaf structure of tRNAAla (Escherichia coliK-12)

Supplementary Fig. S2. Visualization of lattice points containing tDNAs of one amino acid on Penta. (A) Penta listed in Figure 1A. (B) Lattice points containing tDNAs of one amino acid on Penta are visualized as described in Fig. 2A. (C) Number of tDNAs in each lattice point on Penta is presented for individual amino acids by the height of the vertical bar.

Supplementary Fig. S3. Multiple sequence alignment of tDNAs belonging to a sharp peak in a satellite spot located away from the correspondent major territories. Multiple sequence alignment was conducted with ClustalW supported by DDBJ. (A) An example of one base difference. (B) An example of a few base difference. Class/family and species for each tRNA are listed; for details, see tRNADB-CE.

Supplementary Fig. S4. Occurrence level of pentanucleotide of interest in bacterial tDNAs of one amino acid on Penta. Pentanucleotides located in oligonucleotides specified in Fig. 3C are analysed.

Supplementary Fig. S5. Visualization of lattice points containing tDNAs of individual amino acids on PartsTri. Lattice points containing tDNAs of individual amino acids on PartsTri are visualized separately with the colour used in Fig. 1A.

Page 30: Advance Publication by J-STAGE Genes & Genetic Systems

5’

5' side of acceptor-stem

D-arm

Anticodon-arm

Variable-arm (V-arm)

T-arm

3' side of acceptor-stem

3’Fig. S1

Page 31: Advance Publication by J-STAGE Genes & Genetic Systems

Fig. S2

SecGlyGlu

GlnCysPentaA B

C ArgGlyGlu

Page 32: Advance Publication by J-STAGE Genes & Genetic Systems

C11102389-Arg-GCG GTGTCCATAGCTCAGTTGGATAGAGCGTTAGATTGCGATTCTTAAGGTCGGAGGTTCAAGC11102523-Arg-GCG GTGTCCATAGCTCAGTTGGATAGAGCGTTAGATTGCGATTCTTAAGGTCGGAGGTTCAAGC11102556-Arg-GCG GTGTCCATAGCTCAGTTGGATAGAGCGTTAGATTGCGATTCTTAAGGTCGGAGGTTCAAGC002902-Arg-GCG GTGTCCATAGCTCAGTTGGATAGAGCGTTAGATTGCGATTCTTAAGGTCGGAGGTTCAAGC09101517-Arg-GCG GTGTCCATAGCTCAGTTGGATAGAGCGTTAGATTGCGATTCTTAAGGTCGGAGGTTCAAGC11101558-Arg-GCG GTGTCCATAGCTCAGTTGGATAGAGCGTTAGATTGCGATTCTTAAGGTCGGAGGTTCAAGC001499-Arg-GCG GTGTCCATAGCTCAGTTGGATAGAGCGTTAGATTGCGATTCTTAAGGTCGGAGGTTCAAG

************************************************************

C11102389-Arg-GCG TCCTCTTGGACACGAAA Spirochaetes, Borrelia bissettii DN127C11102523-Arg-GCG TCCTCTTGGACACGAAA Spirochaetes, Borrelia burgdorferi JD1C11102556-Arg-GCG TCCTCTTGGACACGAAA Spirochaetes, Borrelia burgdorferi N40C002902-Arg-GCG TCCTCTTGGACACGAAA Spirochaetes, Borrelia garinii PBiC09101517-Arg-GCG TCCTCTTGGACACGAAA Spirochaetes, Borrelia burgdorferi ZS7C11101558-Arg-GCG TCCTCTTGGACACGGAA Spirochaetes, Borrelia afzelii PKoC001499-Arg-GCG TCCTCTTGGACACGGAA Spirochaetes, Borrelia afzelii PKo

************** **

C08004967-Asn-GTT TCCGGCGTAGCTCAACCGGCAGAGCGGGTGGCTGTTAACCACTAGGTTGGGGGTTCGAGTC11123141-Asn-GTT TCCGGCGTAGCTCAACCGGCAGAGCGGGTGGCTGTTAACCACTAGGTTGGGGGTTCGAGTC08004966-Asn-GTT TCCGGCGTAGCTCAACCGGCAGAGCGGGTGGCTGTTAACCACTAGGTTGGGGGTTCGAGTC08011006-Asn-GTT TCCGGCGTAGCTCAACCGGTAGAGCGGGTGGCTGTTAACCACTAGGTTGGGGGTTCGAGTC025681-Asn-GTT TCCGGCGTAGCTCAATAGGCAGAGCGGGTGGCTGTTAACCACTAGGTTGGGGGTTCGAGTC09110479-Asn-GTT TCCGGCGTAGCTCAATAGGCAGAGCGGGTGGCTGTTAACCACTAGGTTGGGGGTTCGAGTC09110480-Asn-GTT TCCGGCGTAGCTCAATAGGCAGAGCGGGTGGCTGTTAACCACTAGGTTGGGGGTTCGAGTC025683-Asn-GTT TCCGGCGTAGCTCAATAGGCAGAGCGGGTGGCTGTTAACCACTAGGTTGGGGGTTCGAGT

*************** ** ****************************************

C08004967-Asn-GTT CCCTCCGCCGGAGCCA Thermotogae, Fervidobacterium nodosum Rt17C11123141-Asn-GTT CCCTCCGCCGGAGCCA Thermotogae, Thermotoga thermarum DSM 5069C08004966-Asn-GTT CCCTCCGCCGGAGCCA Thermotogae, Fervidobacterium nodosum Rt17C08011006-Asn-GTT CCCTCCGCCGGAGCCA Thermotogae, Thermosipho lettingae TMOC025681-Asn-GTT CCCTCCGCCGGAGCCA Thermotogae, Thermosipho melanesiensis BI429C09110479-Asn-GTT CCCTCCGCCGGAGCCA Thermotogae, Thermosipho africanus TCF52BC09110480-Asn-GTT CCCTCCGCCGGAGCCA Thermotogae, Thermosipho africanus TCF52BC025683-Asn-GTT CCCTCCGCCGGAGCCA Thermotogae, Thermosipho melanesiensis BI429

****************

Fig. S3 A

B

Page 33: Advance Publication by J-STAGE Genes & Genetic Systems

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val

GGTTC GTTCG TTCGA TAGCT AGCTC GCTCA

Fig. S4

Page 34: Advance Publication by J-STAGE Genes & Genetic Systems

ValTyrTrpThr

SerProPheMet

LysLeuIleHis

GlyGluGlnCys

AspAsnArgAlaFig. S5