the inherent occurrence of complex intron-rich ... · 1 summary this study shows that an abundance...

12
1 Summary This study shows that an abundance of authentic eukaryotic split genes, with a full complement of complex structures including exons, introns, regulatory and splicing elements, could have occurred by chance within pre-biotic random genetic sequences. Introduction Although the split structure of protein coding genes is fundamental to eukaryotic biology, its ultimate origin is not fully understood (1-3). Recent comparative studies of split-gene structure and the splicing process across phyla have led to several unexpected discoveries, contradicting prevailing assumptions. For example, it is currently thought that split genes existed in a primitive form in the eukaryotic ancestor, and increased in complexity through evolution towards higher animals and plants. In contrast, new findings from interkingdom analyses have shown that the common ancestor of all eukaryotes had complex split gene structures and a highly intron-rich genome as those of modern higher animals and plants (1, 4-8). In addition, the idea that splicing may have been a much simpler mechanism in the eukaryotic ancestor processed by a simple spliceosome now appears incorrect. Interkingdom analyses of the genes of the spliceosome revealed that the earliest eukaryote must have contained all of the >200 proteins and five non-coding RNAs as those present in higher eukaryotes, exhibiting a highly complex structure and function (9-12). In addition, in contrast to the current belief that the very first life form was a primitive organism with contiguous genes (13-18), comparative genomics and proteomics studies have shown the following: 1) the genome of the earliest life-form must have had a complex, eukaryotic-like, and gene-rich organization, and a proteome containing sophisticated proteins (19-35), 2) the complex eukaryotic cell could not have evolved from such a primitive organism (9-12, 36-40), and 3) a eukaryote, rather than a prokaryote was at the base of the evolutionary tree of life (41-53). These findings indicate that the genome of the very first life form could have contained eukaryote-like split genes encoding complex proteins, and question how these split genes could have evolved in the prebiotic system. In a companion study, we found that split coding sequences for complex proteins of modern eukaryotes are found at a high probability in computer generated random DNA sequences (P. Senapathy, et al, accompanying paper). In this study, our objective was to analyze if the other informational sequences for gene-regulation, splicing, and split-gene structures are found in random DNA, and whether they are similar to those of extant eukaryotic genes. The study centered on the concept that split genes, rather than contiguous genes, must have formed within pre-biotic random genetic sequences (DNA or RNA). This idea is based on probability. Informational sequences occur in a random sequence at a far higher probability when they are split into short pieces Origin of gene regulatory and splicing structures 1 Copyright © 2010 by Periannan Senapathy, Genome International Corporation, Madison, WI 53717, USA The inherent occurrence of complex intron-rich spliceosomal split genes, including regulatory and splicing elements, within pre-biotic random genetic sequences Growing evidence indicates that complex intron-rich split genes and an advanced spliceosome existed in the earliest eukaryote, and possibly the first life form. We sought to examine how these split genes could have originated in the prebiotic system. We previously found that split coding sequences for complex proteins occur in abundance in random DNA sequences (P. Senapathy, et al, accompanying paper). This study demonstrates that a full complement of exons, introns and regulatory and splicing elements could have also occurred inherently within pre-biotic chemistry by chance. By comparing the characteristics of split genes found in computer-generated random genetic sequences with those of several extant eukaryotes, we show that an abundance of intron-rich split genes akin to those present in modern eukaryotes could have existed in the prebiotic system. These findings answer the post-genomic question of why the earliest life form contained highly complex intron-rich split genes, and, in conjunction with our companion study, show how they could encode a complex spliceosome. Ashwini Bhasi 1 , Bipin Balan 2 , Brajendra Kumar 2 , Sudar Senapathy 1 , and Periannan Senapathy *,1,2 1 Department of Human Genetics, Genome International Corporation, 8000 Excelsior Drive, Madison, WI 53717, United States of America 2 Department of Bioinformatics, International Center for Advanced Genomics and Proteomics, Chennai, India

Upload: phamphuc

Post on 27-Jan-2019

214 views

Category:

Documents


0 download

TRANSCRIPT

1

Summary

This study shows that an abundance of authentic eukaryotic split genes, with a full complement of complex structures including exons, introns, regulatory and splicing elements, could have occurred by chance within pre-biotic random genetic sequences.

Introduction

Although the split structure of protein coding genes is fundamental to eukaryotic biology, its ultimate origin is not fully understood (1-3). Recent comparative studies of split-gene structure and the splicing process across phyla have led to several unexpected discoveries, contradicting prevailing assumptions. For example, it is currently thought that split genes existed in a primitive form in the eukaryotic ancestor, and increased in complexity through evolution towards higher animals and plants. In contrast, new findings from interkingdom analyses have shown that the common ancestor of all eukaryotes had complex split gene structures and a highly intron-rich genome as those of modern higher animals and plants (1, 4-8). In addition, the idea that splicing may have been a much simpler mechanism in the eukaryotic ancestor processed by a simple spliceosome now appears incorrect. Interkingdom analyses of the genes of the spliceosome revealed that the earliest eukaryote must have contained all of the >200 proteins and five non-coding RNAs as those present in higher eukaryotes, exhibiting a highly complex structure and function (9-12).

In addition, in contrast to the current belief that the very first life form was a primitive organism with contiguous genes (13-18), comparative genomics and proteomics studies have shown the following:

1) the genome of the earliest life-form must have had a complex, eukaryotic-like, and gene-rich organization, and a proteome containing sophisticated proteins (19-35),

2) the complex eukaryotic cell could not have evolved from such a primitive organism (9-12, 36-40), and

3) a eukaryote, rather than a prokaryote was at the base of the evolutionary tree of life (41-53).

These findings indicate that the genome of the very first life form could have contained eukaryote-like split genes encoding complex proteins, and question how these split genes could have evolved in the prebiotic system.

In a companion study, we found that split coding sequences for complex proteins of modern eukaryotes are found at a high probability in computer generated random DNA sequences (P. Senapathy, et al, accompanying paper). In this study, our objective was to analyze if the other informational sequences for gene-regulation, splicing, and split-gene structures are found in random DNA, and whether they are similar to those of extant eukaryotic genes.

The study centered on the concept that split genes, rather than contiguous genes, must have formed within pre-biotic random genetic sequences (DNA or RNA). This idea is based on probability. Informational sequences occur in a random sequence at a far higher probability when they are split into short pieces

Origin of gene regulatory and splicing structures 1

Copyright © 2010 by Periannan Senapathy, Genome International Corporation, Madison, WI 53717, USA

The inherent occurrence of complex intron-rich spliceosomal split genes, including regulatory and splicing elements, within pre-biotic random genetic sequences

Growing evidence indicates that complex intron-rich split genes and an advanced spliceosome existed in the earliest eukaryote, and possibly the first life form. We sought to examine how these split genes could have originated in the prebiotic system. We previously found that split coding sequences for complex proteins occur in abundance in random DNA sequences (P. Senapathy, et al, accompanying paper). This study demonstrates that a full complement of exons, introns and regulatory and splicing elements could have also occurred inherently within pre-biotic chemistry by chance. By comparing the characteristics of split genes found in computer-generated random genetic sequences with those of several extant eukaryotes, we show that an abundance of intron-rich split genes akin to those present in modern eukaryotes could have existed in the prebiotic system. These findings answer the post-genomic question of why the earliest life form contained highly complex intron-rich split genes, and, in conjunction with our companion study, show how they could encode a complex spliceosome.

Ashwini Bhasi1, Bipin Balan2, Brajendra Kumar2, Sudar Senapathy1, and Periannan Senapathy*,1,2

1 Department of Human Genetics, Genome International Corporation, 8000 Excelsior Drive, Madison, WI 53717, United States of America2 Department of Bioinformatics, International Center for Advanced Genomics and Proteomics, Chennai, India

2

than when they are contiguous. Furthermore, the regulatory and splicing sequences also should exist at a high probability within random genetic sequences, as they are relatively short and have a high tolerance to sequence variations. This model for the random formation of split genes in prebiotic genetic sequences is termed the random sequence origin of split genes (ROSG) (54-58). We compared the genes of nine eukaryotes from distinct phyla with split genes found in computer generated random DNA. The results show that split genes with authentic structural elements including regulatory sequences do occur in random DNA at comparable frequency and complexity as found in intron-rich genomes of modern higher eukaryotes. Our findings may help explain why the eukaryotic ancestor, and possibly the earliest life form, contained complex split genes as well as an advanced spliceosome for splicing them.

Rationale for the ROSG model

Two basic principles uphold the ROSG model: 1) The frequent occurrence of the three stop codons would have severely limited the length of open-reading frames (which in this paper is defined as the sequence between consecutive stop codons in the same reading-frame, ORFs); therefore, a sequence that coded for a complete protein, which is normally far longer than the limited length of randomly created ORFs, was forced to split into many short pieces (54-58), with intervening non-coding sequences. 2) The rarity of even these short informational segments would have forced the lengths of the intervening sequences (i.e., the introns) to be very long, and this is indeed found in most genes (P. Senapathy, et al, accompanying paper). Additionally, if short exons fully occupied the ORFs in which they occurred, their ends should inherently contain stop codons, which could have evolved into splice signals (54-58). Thus, fully functioning split genes coding for complex proteins must have occurred within long pre-biotic random genetic sequences, without having to evolve from contiguous genes encoding primitive proteins as is currently believed under the linear branching evolution (LBE) model.

RESULTS

Complex, intron-rich split genes in random DNA

We used established gene prediction software used in genome projects to determine if split genes would be found in random DNA sequences, and, if so, whether their structural features would resemble those of modern eukaryotic genes. Four organism categories were used to compare with the random genes: large genomes with intron-rich genes (sea urchin and human), small genomes with intron-rich genes (Trichoplax, sea anemone and Arabidopsis), small genomes with intron-poor genes (Drosophila, and Caenorhibditis) and very small genomes with highly intron-

poor genes (Plasmodium and yeast) (Supplementary Information Table 1). The genes from these groups vary mainly in the density of introns, and the length of exons, introns, and genes. They use essentially the same complex spliceosomal machinery and splicing and regulatory sequences (9-12), and encode equally complex proteins (59). Random DNA sequence data sets of lengths 5 KB to 300 KB (the approximate length range of eukaryotic genes) were generated computationally (Table 1). Split genes were found in multiple sequences for each data set using ab initio gene-prediction software such as Genscan (60), Augustus (61), and GeneID (62), each trained on the genomes of different model organisms. The length of the gene, exons and introns, and number of introns in random genes were compared with those of the nine eukaryotic genomes.

The results from Genscan compared with the human gene model are discussed below. Most of the random gene features were very similar to those in the human (Table 1 and Figure 1). The average length of exons was constant (~150 bases), though the length and the number of introns in genes increased with the length of the DNA sequence data set (5 KB to 300 KB). The density of introns was high (7-8 per gene) in random and human genes in longer datasets, as observed in intron-rich genomes (63, 64).

Genes predicted by different programs

Realizing that gene prediction programs such as GenScan are based on the hidden Markov model that were trained on known genes, we analyzed genes predicted by other gene-prediction programs (GeneID, Augustus) trained on genomes containing genes that exhibited widely varying split gene features. Irrespective of the program used and the organism trained, the results were remarkably consistent in two respects: random genes were intron-rich (average 7-8 introns per gene), and exons were very short (~150 bases; Supplementary Information Table 2). Gene prediction programs trained on intron-poor organism models (Plasmodium and yeast) produced few or no random genes, likely due to the complete lack of long ORFs in random DNA. This observation fits the ROSG model (see below).

Origin of gene regulatory and splicing structures 2

Copyright © 2010 by Periannan Senapathy, Genome International Corporation, Madison, WI 53717, USA

!"#$%&'()'*+,'

,-"./$"'+012".'()''3"#"4'

,-"./$"'3"#"'!"#$%&'

,-"./$"'56(#'!"#$%&'

,-"./$"'7#%.(#'!"#$%&'

,-"./$"'7#%".8$"#9:'*94%/#:"'

,-"./$"'#012".'()'7#%.(#4';".'$"#"'

! <4' =/#> <4' =/#>' <4' =/#> <4' =/#> <4' =/#>' <4' =/#>?@A' BCD' ECE' EDFB' GDFH' BDD' E?D' DIB' EEBJ K?E' BKG' HCK EC?'EH@A' ?C?' ECD' GFHB' DJ?B' GKE' E??' IHE' EDBG KHI' EHJJ' ECI GCI'G?@A' JCB' GCG' ?DJH' JDB?' GEI' E?G' EEHK' EIHD EJBE' BIGB' BC? DCF'?H@A' EGCK BCD' JDBD' EGJIJ' EJG' E?G' E?FI' EIKH BD?D' IJIK' ?CH FCG'EHH@A EFCG FCG' EDJED' EDFID' EIJ' E?E' GEDE' EKEB ?JHJ' EBIJ?' FC? FCJ'E?H@A EICJ KCK' EKJIB' E?FFK' EII' E?G' G?BH' EKEK I??J' E?D?E' ICG ICB'GHH@A EJCH EECF' GEJIG' E?JH?' EIG' E?G' GIFI' EKBE JEEB' EFDJJ' KCH ICB'G?H@A EJCK EDCB' GD?HI' EFGKD' EIE' E?E' GJK?' EKEK EHGBI' EIBFK' KCB ICF'BHH@A' GHCB' GFDKI' EFDFB' EIE' E?G' BED?' EKBE' EEH?E' EIFBK' KCF' ICF'EICH

Table 1. | The average values of the features of random split-genes (predicted from 1000 random DNA sequences of varying lengths shown) compared to those of human genes (occurring in genomic DNA of corresponding lengths). Hs - Homo sapiens; Rand - Random genes

3

The probability of split genes in random DNA

Our previous studies have shown that the lengths of ORFs are restricted in a random DNA sequence. Their mean lengths are 20 codons (due to the average occurrence of three stop codons for every 64 codons). The probability, W, that a waiting-interval, t, will occur after a sequence element with probability p, is: W(t) = p * (1-p)t (58, 65). It follows then that the probability of longer ORFs reduces exponentially. The probabilities of increasing ORF lengths plotted using this equation (Supplementary Information Figure 1) shows that the vast majority of ORFs are extremely short (95% are shorter than 180 bases; 99.9% are shorter than 600 bases, which on average, is the maximum exon length (54-58, 65).

We tabulated the average lengths of protein-coding sequences and their corresponding ORFs from the nine eukaryotes and two prokaryotes (Supplementary Information Table 3). They were equally long (average coding-sequence length of 3000-4000 bases). The EML for a 3000 base long ORF occurring in a random DNA sequence is exceptionally large (7 x 1020 bases). The very long contiguous ORFs required for the occurrence of the coding sequences of any of these extant proteins could not have existed in any practical length of random DNA sequence. Taking into account the codon degeneracy (CD) in genes and amino acid variability (AAVAR) in proteins (P. Senapathy, et al, accompanying paper), the EML of the random DNA sequence for the occurrence of a contiguously coding gene for a protein sequence of 400 amino acids (AA) with an average AAVAR of 16 AA per AA position is:

EML400 AA sequence = 1/(CD*AAVAR/64)400

= 1/(3.2*16/64)400 = 1039 bases, which is an impractical amount (1011 tons) of random DNA .

We proposed that nature circumvented this problem by splitting the long coding sequence of a protein into short sequence segments available from the short ORFs, and splicing them together into a contiguous coding sequence. This model requires that the length of the coding sequence of a split gene in

extant organisms should be proportional to the length of the split gene and the number of exons in the gene. We found this to be true (Supplementary Information Figure 2) (54), conforming to the following equations derived from the average lengths of exons (~150 bases) and introns (~2000 bases) in random split genes:

Number of exons in a gene = coding sequence length of the gene/150 bases

Length of the split gene = (Number of exons x 150) + (Number of introns x 2000) bases

The average length of introns in a split gene is also predictable. Effectively, the intron represents the waiting interval in a random sequence until the occurrence of the subsequent exon. The expected length of an intron, therefore, is approximately the EML of the subsequent exon - the length of that exon. Given that EML of intron >> length of the exon, the predicted intron length is approximately the EML of the subsequent exon. The reason that certain introns are very long (~104-105 bases) may be due to the lower probability of the exon located after that intron (due to the length of exon or the AAVAR of the protein sequence it encodes). Our analysis also shows evidence that the long introns in intron-rich random genes were genetic waste; however, they could have contained elements such as splicing enhancers that aided in their removal, as well as other regulatory elements such as miRNA, by chance. Nonetheless, even these elements would constitute a minute (~1%) fraction of these introns.

Extremely short exons in the first life forms

To further determine if the ROSG predictions are valid, we analyzed the lengths of exons and their corresponding ORFs within random genes and extant intron-rich genes by an exon-ORF length scatter plot. The length of every exon from a given set of genes was plotted against its corresponding ORF length.

The lengths of most exons and ORFs were very short in random genes (97% were <200 bases, Figure 2A). The spliced exons (i.e., complete coding sequence) and their corresponding

Origin of gene regulatory and splicing structures 3

Copyright © 2010 by Periannan Senapathy, Genome International Corporation, Madison, WI 53717, USA

!"##$%&'(%)"*+,%&)-.))/%0%1

#$%&'(%))/%0%)2%0(34

#$%&'(%)56-0)2%0(34

#$%&'(%)703&-0)2%0(34

#$%&'(%)703%& 8(%09:)

!913'0:%

#$%&'(%)0*+,%&)-.)

703&-01;'0<-+ =>? @ABCB @DA @CEF CBCE G>A!"#$%&'()$ @A>E B?=? @BA @DGC =?D? D

*"#+,%-'%)% AD>= A@G= A=@ @DC @G?D ?>E."#(-(/%)$ @E>E ACA@ AFA AEB @G=C D>C0"#1(-%)2/%$+(3 @A>= ===E =EG ?E? @CDB =>B

4"#5%-6'&%371 D>@ AAC= BAD @CE @CGF @>=B8"#6(3(9'$'%( D>C @?CB @=CE =A= DD@ F>FG

Table 2. | Features of split-genes from random genes (predicted by Genscan from 1000 random DNA sequences each 50 KB in length) compared to those of genes from different eukaryotes (occuring in 50 KB datasets of corresponding genomic DNA).

!"#$#%&" '(#) *)%"#) !#+,-.-/0%&

1231324142515261621

7

2

*

8

9

:

'

;

<

=

.

Figure 1. | The structure of random genes (predicted by Genscan software) and human genes from 50 KB datasets. The positions of regulatory sequences, exons and introns are shown for five random genes (A, B, C, D and E) and five human genes (F, G, H, I, and J), each containing a minimum of 10 exons (scale 0-50 KB).

4

ORF lengths from each gene were much longer (with a peak around 2,250 bases, Figure 2A). This finding supports the ROSG premise that such long ORFs arrived by splicing short coding sequences together.

Similar features were observed for the large intron-rich genomes (human, Figure 2B; and sea urchin, not shown) and the small intron-rich genomes (arabidopsis, Figure 2C; and trichoplax and sea anemone, not shown). It is likely that intron rich genes with long introns correspond to the random genes, and those with short introns were derived by shortening the introns (54).

Smaller, intron-poor genomes (drosophila, caenorhibditis, plasmodium, and yeast; Figures 2D-2G,) revealed a higher fraction of exons >600 bases, and correspondingly longer ORFs. These non-conforming long exons likely result from splicing two or more shorter exons, with associated loss of introns. The lengths of their genes’ spliced coding sequences and corresponding ORFs (Figures 2D-2G) were as long as those of intron-rich genes and random genes (See also Supplementary Information Table 3).

The origin of splice signals

The ROSG model states that an exon can either completely occupy an ORF or be contained within it. If the majority of exons fill their respective ORFs, the pre-biotic molecular evolutionary system could have used the ends of ORFs as the splicing points. The stop codons bordering the ORFs could have evolved into splicing sequences, as splicing would have essentially joined the ends of many different ORFs. If this is true, the ends of exons of random genes and extant genes should contain stop codons that are part of the splicing signals (54-58).

We tested this in random and extant genes using the Exon-ORF (ExORF) Plot (P. Senapathy, Unpublished, (54-57)) that reveals the structures of exons within their corresponding ORFs. The majority of exons in random genes fully occupied their corresponding ORFs (see Figure 3 for a representative ExORF, and http://66.170.16.155/cgi-bin/EOP/eop_new.pl for ExORFs of more random genes and extant genes). The plot also shows that the ends of these exon-containing ORFs are parts of splice signals (using splice sequence data from the EuSplice database (66)).

The ExORF patterns of the human genes were extremely similar to those of random genes (Figure 3). Most of the ORFs in intron-poor genes of small genomes were short, consistent with the expected random ORF lengths (<600 bases), with the exception of the rare, long ORFs that contained long exons (which were far longer than the random ORF lengths). This exception also fits the ROSG view that the intron-poor, long-exon containing genes of small genomes must have been derived

Origin of gene regulatory and splicing structures 4

Copyright © 2010 by Periannan Senapathy, Genome International Corporation, Madison, WI 53717, USA

Figure 2. | Confinement of exons within short, random ORFs. The lengths of exons from split genes in 100 KB datasets of the seven genomes shown were analyzed by a scatter plot as described in Methods. The lengths of exons (blue dots) and CDS (spliced exons; red dots) of 200 genes above the median CDS length in the dataset were plotted against the length of their corresponding ORFs. The frequencies of exon lengths (green bars; bin size 100 bases) and CDS lengths (purple bars; bin size 100 bases) were plotted against the lengths of their corresponding ORFs.

!"

#!!!"

$!!!"

%!!!"

&!!!"

'!!!"

!"

$!"

&!"

(!"

)!"

#!!"*+,"-./0.1234.5671"-./0.1234.*+,"8.14295671"8.1429""""

!"":31;7<

!"

#!!!"

$!!!"

%!!!"

&!!!"

!"

$!"

&!"

(!"

)!"

!"

#!!!"

$!!!"

%!!!"

&!!!"

!"

$!"

&!"

(!"

)!"

!"

#!!!"

$!!!"

%!!!"

&!!!"

!"

$!"

&!"

(!"

)!"

8.1429"7="5671"7/"*

+,

!"

#!!!"

$!!!"

%!!!"

&!!!"

!"

$!"

&!"

(!"

)!"

!"

#!!!"

$!!!"

$!!!"

%!!!"

&!!!"

!"

$!"

&!"

(!"

)!"

!"

#!!!"

%!!!"

&!!!"

!"

$!"

&!"

(!"

)!"

>:?"@.1429"

?/.AB.10C"7=".671D"7/">:?D

!'!!

#!!!

#'!!

$!!!

$'!!

%!!!

%'!!

&!!!

&'!!

'!!!

#""EB<31

$""F/3GH;7IDHD

%""+/7D7I9H@H3

&""*3.17/9HG;H2HD

'""-@3D<7;HB<

("",30093/7<C0.D

5

from intron-rich, short-exon containing random genes in pre-biotic chemistry (54). In addition, exon-containing ORFs were extremely rare (Figure 3), indicating that very few biologically useful sequence segments were available for forming the split gene.

The majority of exons are bordered by stop codons in random genes and the genes of all the nine eukaryotes at the same position of splice signals, demonstrating that these stop codons are key integral parts of the splice signals (Figure 3), as predicted by ROSG (54-58). The frequencies of stop codons at exon ends in both random and extant genes were far higher than those expected by chance for any given codon (Supplementary Information Table 4). Additionally, the stop codons occur as key parts of the AG:GT canonical splice signal sequence on the side of introns. These findings substantiate the possibility that splice signals evolved from the stop codons at ORF ends (54-58) during the pre-biotic molecular evolution of split genes.

Authentic regulatory sequences in random genes

If authentic split-gene structures were to occur in random DNA, then the different regulatory sequences should exist in the correct order, distance, and combinations as are found in extant eukaryotic genes. We computed the probabilities and EMLs of these regulatory (promoter and PolyA site) and splicing (donor and acceptor) signals based on their variable sequences in eukaryotic genes, and found that they should indeed occur at a high frequency and proper sequential order (data not shown). As predicted, the structures of the regulatory and splicing signals of the random genes were found to resemble those of extant eukaryotic genes (Figure 1). To verify that the regulatory and splicing sequences in random genes exhibited authentic sequence characteristics, we compared their position weight matrices

(PWMs) with those of extant genes (67-69). The PWM scores and consensus sequences of random genes and eukaryotic genes were remarkably similar (Table 3).

We had previously found that the splicing elements and branch point sites occurred at a nearly random frequency in human genes, and that the real elements of these genes were embedded among them (67-69). These findings support the idea that all of these elements occurred abundantly in random genetic sequences, in which split gene structures occurred, without the need for evolving them from contiguous genes. Our studies also show that the pre-biotic stochastic selection of the 4-character DNA alphabet and 20-character protein alphabet enabled the abundant occurrence of these structural elements and the complete split genes within random genetic sequences.

Origin of gene regulatory and splicing structures 5

Copyright © 2010 by Periannan Senapathy, Genome International Corporation, Madison, WI 53717, USA

Figure 3. | The Exon-ORF Plots (ExORFs) of a representative (A) random gene and (B) human gene (coding for ATP-dependent RNA helicase DDX42 protein). The ORFs in the three reading frames of the split (i.e., unspliced) genes are shown on the left. The ORF patterns of the complete spliced coding sequence are shown on the right. Introns with several stop codons occupy the majority of unspliced RFs. All exons are confined within short ORFs and most exons are bordered by stop codons in the unspliced sequence, indicating that they are part of the splice signals. The ORF pattern upstream of the spliced coding sequence (long yellow box), which may contain regulatory sequences, is similar to those of introns.

Table 3. | The consensus sequences of regulatory sequences (Position Weight Matrices (PWMs)) from random genes (average from 1000 random DNA sequences), and human genes (EuSplice database for splice signals; Eukaryotic promoter database, for promoters, and PolyA DB for PolyA signal) (see Methods).

!"#$%&'(%() *$%+,#&-(%()

.//(01,2 345346347348347348344347386389:379 .;<<';<< =*4;

35>35>35635435435835535434;34<:384.;<<';<< =*5>

?,%,2 @57.68'8;=&';<< A;<< *7; .5B'48: @59.9> '58=&';<< A;<<*48.6;'4< :C2,#,1(2 A9> D95 D66 D5B D4B .5>.65.>9 A9; D9; D97 D59 D65.56.6;.99C,EF.&)G1( .88.48A79.75.7;.;<< .77.79A47.88.77.78

1 6 12 18 24 30 36 42

45539 1

7496

!""#$%&'(")*%*"*+'%",#-"./'0!!!!!!"#$%&#!'()*+*,-""

1""23($%")*%*"-456789:"*+'%",#-"./'0!!!!!!"#$%&#!'()*+*,-"" .$/#&!'()*+*,-""

.$/#&!'()*+*,-""012

2 3

4

24

015

255

6 27 28

9 2: 29

01::

8

";'#' ";'#'

22 26

<

012

015

01:

1 6 12 18 24 30 36 42

42701 1

5363

8

4

";'#' ";'#'

2:

27 28 29395

26<:222

6 25012

015

01:

012

015

01:

6

Discussion

Complex split genes in the first life form

Under ROSG, sequence probability is particularly important. The probability of regulatory and splicing signals of different genes are nearly the same, and the probabilities of the different exons of genes (restricted to an average length of 150 bases in intron-rich genes) coding for entirely distinct proteins are also essentially equal (P. Senapathy et al, accompanying paper). The EML of the random DNA sequence required for the chance occurrence of a split gene coding for any given protein, exhibiting codon degeneracy and amino acid variability, has been computed to be ~1016 bases (1 µg) (P. Senapathy, et al, accompanying paper). The same 1 µg of random DNA would have contained split genes for one copy of virtually any given protein sequence of the biota. To have more copies for the genomes of more distinct life forms, this value needs only to be multiplied (e.g., one million times more is only one gram of DNA). The split genes could have then combined into eukaryotic genomes by the same self-assembly mechanisms that have been proposed for the first primitive life form by the LBE model. It follows then that the origin of life need not have required a large amount of pre-biotic genetic sequences.

The pre-requisite for any gene to originate and evolve in the pre-biotic system under LBE or ROSG is that genetic sequences must have existed in the pre-biotic system. For LBE, 1011 tons of random DNA is required. For ROSG, 1 µg - 1 mg of random DNA is required. On this basis, this study shows that intron-rich split genes with short exons and long introns, and regulatory and splicing sequences, as those found in modern intron-rich large genomes could have occurred in abundance in a small amount of pre-biotic genetic sequences. These genes could have provided the foundation for the pre-biotic evolution of the first eukaryotic genomes. The post-genomic findings that the earliest eukaryote contained highly intron-rich genes as well as a fully formed spliceosome (1, 4-12) are consistent with our findings.

The validity of random split genes

The essentially consistent results from different gene prediction tools and gene models surmount the potential objection that these gene prediction programs are generative HMM which were trained on known genes. Our elemental finding is that complex intron-rich protein-coding split genes with short exons and long introns, and with genuine regulatory and splicing sequences, must have occurred at high frequency in pre-biotic random genetic sequences.

Regulatory sequences in random genes

We have shown here that the “core” TATA box promoter elements occur in random genes as frequently as in eukaryotic genes (Table 3). Recent findings show that eukaryotic genes may have multiple promoters with multiple transcription start

sites (TSSs), and that alternative promoter usage generates diversity and complexity of gene expression (70). These promoters contain several recognition sequences such as the TATA box and CpG islands. The un-translated regions (UTRs) of genes also contain multiple regulatory sequences. These studies reveal that the regulatory elements are short and occur at variable distances from the TSS with sequence variations in different genes. We have previously shown that given sequence elements could occur at very short distances in a random DNA sequence due to the negative exponential distribution of the distances (waiting intervals) between them (65). Thus, under ROSG, the probability for the chance occurrence of multiple promoter and UTR elements that tolerate sequence variations around transcription initiation or termination sites of genes that occur within random sequences is very high. In contrast, it is extremely improbable for the evolution of the highly complex eukaryotic promoters from the simple TATA or CAT box promoters of bacterial genes through random genetic mutations.

Split ncRNA genes

Apart from protein-coding genes, non-coding RNA (ncRNA) genes should have also formed randomly in prebiotic chemistry. It turns out that most of the ncRNA genes are also split, which increases their probability of occurring in random DNA. It has now been established that tRNA and rRNA genes are split (71-75), and possibly tolerate sequence variations. Furthermore, recent studies have shown that tRNA genes must have originated as split genes in the earliest life form (72, 74-76). A typical miRNA (18-22 bases) is expected to occur in approximately 1013 (422) bases, which is far shorter than the random DNA sequence of 1016 bases (1 µg) required for the chance occurrence of any protein coding gene (P. Senapathy, et al, accompanying paper). Thus, all the necessary protein coding genes, regulatory sequences, and ncRNA genes that are required for the origin of a eukaryotic genome could have occurred within a relatively short random genetic sequence.

Concluding Remarks

The linear branching evolution model’s assumption that a primitive set of contiguously coding genes originated pre-biotically in a bacterium-like life form and evolved into the entire set of complex split genes of all eukaryotes is no longer sound in light of several post-genomic findings. However, there has been no alternative solution in the literature to address this incorrect assumption. Furthermore, although the possibility that split genes were the very first genes has existed for thirty years (1-3), there has been no attempt in the literature other than the ROSG model (54-58) to explain how these split genes originated and served as the basis for the first life form.

ROSG shares the same pre-biotic chemistry and self-assembly mechanisms as those of LBE, and assumes that pre-biotic genetic sequences must have existed, but it splits from LBE on the nature of the first genes. While impractically huge

Origin of gene regulatory and splicing structures 6

Copyright © 2010 by Periannan Senapathy, Genome International Corporation, Madison, WI 53717, USA

7

amounts of random DNA would have been required for the chance occurrence of a contiguously coding gene, our study shows that one microgram to a milligram of pre-biotic random DNA would have been sufficient to contain virtually all of the split genes of the biota. This study provides a non-teleological, chance based approach for explaining the pre-biotic origin of complex split genes. Thus, ROSG provides a viable model for the origin of biological information and the origin and diversity of life. The conclusions of this analysis may serve as a platform for further studies on the origin of complex split genes and their evolution into genomes and organisms.

Methods

Data source

The annotated Genbank data and the nuclear sequence data (FASTA) for the seven genomes H. sapiens, A. thaliana, D. melanogaster, C. elegans, P. falciperum, S. cervisiae and E. coli respectively were downloaded from NCBI (ftp://ftp.ncbi.nih.gov/genomes/) and parsed to extract the relevant features. Additionally the S. purpuratus, N. vectensis and T. adherens genomic information were downloaded from the links ftp://ftp.ncbi.nih.gov/genomes/Strongylocentrotus_purpuratus, http://genome.jgi-psf.org/Nemve1/Nemve1.download.ftp.html, and http://genome.jgi-psf.org/Triad1/Triad1.download.ftp.html, respectively. The human promoter data set was downloaded from the Eukaryotic Promoter database (EPD) http://www.epd.isb-sib.ch/, and the PolyA signal data set for human was downloaded from the PolyADB (http://polya.umdnj.edu/polyadb/). We used various gene prediction software based on Markov model and trained on different model genomes. Genscan, Augustus, and GeneID were downloaded from http://genes.mit.edu/license.html, http://augustus.gobics.de/binaries/, and ftp://genome.crg.es/pub/software/geneid/, respectively. The human splice signal data was used from the EuSplice database (http://eusplice.genome.com/). Random DNA sequences of different lengths were generated using the Perl program.

Comparison of gene structures in random and extant genomes

We computationally generated different sets of 1000 random DNA fragments of lengths 5, 10, 25, 50, 100, 150, 200, 250 and 300 KB, respectively. Each set of 1000 sequences of varying lengths was separately provided as input to the Genscan or other tools. The output file was parsed and only the authentic genes containing promoter, exons and polyA site were taken into consideration. Additionally, non-overlapping contiguous fragments of lengths corresponding to the random DNA data set lengths were extracted from the genomes of the nine organisms. Complete genes present within these lengths of fragments were isolated based on their Genbank annotation, and the average number of genes, exons and introns per gene, and average gene length, exon length, intron length and the intergenic distances, respectively, were identified from these data sets. Comparative

analysis of these gene features were conducted between the random genes and the genes from the nine organisms. We used the data set length of 50 KB for comparing the random gene features with those of extant genomes. The average length of introns in large intron-rich genomes is approximately 2000 to 3000 bases; so the average length of a 10-exon gene is ~25KB. As there are genes with >100 exons in such genomes, we wanted to use a reasonable length of a gene (e.g., 50 KB) for comparison. Furthermore, we used 50 KB as reasonable representative length of a eukaryotic gene sequence, as the lengths of genes in eukaryotic genomes range between 1000 bases up to two million bases (1).

Exon-ORF (ExORF) analysis of random genes

The ExORF (http://66.170.16.155/cgi-bin/EOP/eop_new.pl) was developed in Perl/CGI. ExORF plots were generated using Perl GD graphics module and populated into a MySQL 4.1 database. ExORF user interface facilitates the user to display the comparative view of un-spliced and spliced gene sequences in three reading-frames (RFs) with exons overlaid in corresponding ORFs in the respective RFs where they occurred. This site contains the ExORFs for genes in six extant genomes, and 200 random genes predicted by Genscan gene prediction tool.

Computing the Position Weight Martices (PWMs) and consensus sequence of regulatory sequences

Only authentic genes containing both a promoter and a polyA site, that were predicted in the random DNA datasets using a gene prediction software (Genescan, Augustus or GeneID), were chosen. Coordinates of exons, introns and regulatory sequences (promoters, splice signals, and polyA sites) of the predicted genes from random sequences (termed random genes) were obtained from the given software’s result files using a parser script. We determined the PWMs and the consensus sequences of the promoter, polyA sites, and donor and acceptor splice signals from 1000 random genes. Promoter, polyA and splice signal PWMs for human genes were determined from their respective databases (see above).

Analysis of stop codons within the splice signals of extant and random genes

The frequency of codons at the -3 position of the acceptor splice signal and the +2 position of the donor splice signal were computed from 9,309 known human genes (with "Reviewed" and "Validated" tags) extracted from Eusplice dataset, and 1,778 random genes predicted from 1000 random DNA sequences each of 50 KB length. The frequencies of the three stop codons, CAG, and the other codons with a single base difference from the stop codons (SBD) were determined.

Origin of gene regulatory and splicing structures 7

Copyright © 2010 by Periannan Senapathy, Genome International Corporation, Madison, WI 53717, USA

8

Acknowledgements

We wish to thank Kanika Arora, Vinu Manikandan, Raj Kuppuswami, Phillip Simon, Chandan Kumar Singh, Kavin Senapathy, Jeffrey Mattox and H. Adam Steinberg for their assistance in improving the manuscript, and Tomasz Wojciechowski for assistance with statistics and probability. We also wish to thank ArtforScience.com for help in preparing the figures.

Correspondence should be addressed to P. S. ([email protected])

References

1. E. V. Koonin, J Hered 100, 618 (Sep-Oct, 2009).2. S. W. Roy, W. Gilbert, Nat Rev Genet 7, 211 (Mar, 2006).3. F. Rodriguez-Trelles, R. Tarrio, F. J. Ayala, Annu Rev Genet

40, 47 (2006).4. I. B. Rogozin, Y. I. Wolf, A. V. Sorokin, B. G. Mirkin, E. V.

Koonin, Curr Biol 13, 1512 (Sep 2, 2003).5. J. C. Sullivan, A. M. Reitzel, J. R. Finnerty, Genome Inform

17, 219 (2006).6. E. Pennisi, Science 317, 27 (Jul 6, 2007).7. N. H. Putnam et al., Science 317, 86 (Jul 6, 2007).8. M. Srivastava et al., Nature 454, 955 (Aug 21, 2008).9. L. Collins, D. Penny, Mol Biol Evol 22, 1053 (Apr, 2005).10. E. V. Koonin et al., Genome Biol 5, R7 (2004).11. V. Anantharaman, E. V. Koonin, L. Aravind, Nucleic Acids

Res 30, 1427 (Apr 1, 2002).12. M. Lynch, A. O. Richardson, Curr Opin Genet Dev 12, 701

(Dec, 2002).13. R. F. Doolittle, Science 214, 149 (Oct 9, 1981).14. R. F. Doolittle, Trends Biochem Sci 14, 244 (Jul, 1989).15. Kuppers. (The MIT Press, Massachusetts, 1989), pp. 59-61.16. F. B. Salisbury, Nature 224, 342 (1969).17. J. Monod. (Vintage Books, New York, 1972).18. H. P. Yockey. (Cambridge University Press, New York,

2005).19. C. P. Ponting, R. R. Russell, Annu Rev Biophys Biomol

Struct 31, 45 (2002).20. C. A. Orengo, J. M. Thornton, Annu Rev Biochem 74, 867

(2005).21. R. L. Marsden et al., Philos Trans R Soc Lond B Biol Sci

361, 425 (Mar 29, 2006).22. G. Caetano-Anolles, M. Wang, D. Caetano-Anolles, J. E.

Mittenthal, Biochem J 417, 621 (Feb 1, 2009).23. C. G. Kurland, B. Canback, O. G. Berg, Biochimie 89, 1454

(Dec, 2007).24. N. Glansdorff, Mol Microbiol 38, 177 (Oct, 2000).25. M. Wang, L. S. Yafremava, D. Caetano-Anolles, J. E.

Mittenthal, G. Caetano-Anolles, Genome Res 17, 1572 (Nov, 2007).

26. S. Yang, R. F. Doolittle, P. E. Bourne, Proc Natl Acad Sci U S A 102, 373 (Jan 11, 2005).

27. B. Labedan et al., J Mol Evol 49, 461 (Oct, 1999).28. G. Caetano-Anolles, D. Caetano-Anolles, Genome Res 13,

1563 (Jul, 2003).

29. P. Forterre et al., Biosystems 28, 15 (1992).30. R. L. Sherrer, P. O'Donoghue, D. Soll, Nucleic Acids Res 36,

1247 (Mar, 2008).31. P. Alifano et al., Microbiol Rev 60, 44 (Mar, 1996).32. A. Habenicht, U. Hellman, R. Cerff, J Mol Biol 237, 165

(Mar 18, 1994).33. N. Benachenhou-Lahfa, P. Forterre, B. Labedan, J Mol Evol

36, 335 (Apr, 1993).34. B. Labedan, Y. Xu, D. G. Naumoff, N. Glansdorff, Mol Biol

Evol 21, 364 (Feb, 2004).35. O. Kandler, in Early Life on Earth. (Columbia University

Press, New York, 1994), vol. 8, pp. 152-160.36. C. Rotte, W. Martin, Nat Cell Biol 3, E173 (Aug, 2001).37. W. Martin, Proc. R. Soc. Lond. B Biol. Sci 266, 1387 (1999).38. W. Martin, Curr Opin Microbiol 8, 630 (Dec, 2005).39. T. M. Embley, W. Martin, Nature 440, 623 (Mar 30, 2006).40. C. G. Kurland, L. J. Collins, D. Penny, Science 312, 1011

(2006).41. A. Poole, D. Jeffares, D. Penny, Bioessays 21, 880 (Oct,

1999).42. A. M. Poole, D. C. Jeffares, D. Penny, J Mol Evol 46, 1 (Jan,

1998).43. D. Penny, A. Poole, Curr Opin Genet Dev 9, 672 (Dec,

1999).44. N. Glansdorff, Y. Xu, B. Labedan, Biol Direct 3, 29 (2008).45. L. A. Katz, Int J Syst Evol Microbiol 52, 1893 (Sep, 2002).46. P. Lopez, P. Forterre, H. Philippe, J Mol Evol 49, 496 (Oct,

1999).47. E. Bapteste, C. Brochier, Trends Microbiol 12, 9 (Jan, 2004).48. P. Forterre, N. Benachenhou-Lafha, B. Labedan, Nature 362,

795 (Apr 29, 1993).49. P. Forterre, Nature 335, 305 (1992).50. P. Forterre, H. Philippe, Bioessays 21, 871 (1999).51. H. Philippe, P. Forterre, J. Mol. Evol 49, 509 (1999).52. P. Forterre, H. Philippe, Biol Bull 196, 373 (Jun, 1999).53. P. Forterre, ASM News 63, 89 (1997).54. R. Regulapati, A. Bhasi, C. K. Singh, P. Senapathy, PLoS

One 3, e3456 (2008).55. P. Senapathy, Science 268, 1366 (Jun 2, 1995).56. P. Senapathy, Proc Natl Acad Sci U S A 85, 1129 (Feb,

1988).57. P. Senapathy, Proc Natl Acad Sci U S A 83, 2133 (Apr,

1986).58. P. Senapathy, Independent Birth of Organisms. Genome

Press, Madison, Wisconsin, 1994.59. E. Szathmary, F. Jordan, C. Pal, Science 292, 1315 (May 18,

2001).60. C. Burge, S. Karlin, J Mol Biol 268, 78 (Apr 25, 1997).61. M. Stanke, R. Steinkamp, S. Waack, B. Morgenstern,

Nucleic Acids Res 32, W309 (Jul 1, 2004).62. G. Parra, E. Blanco, R. Guigo, Genome Res 10, 511 (Apr,

2000).63. J. C. Venter et al., Science 291, 1304 (Feb 16, 2001).64. E. Sodergren et al., Science 314, 941 (Nov 10, 2006).65. P. Senapathy, Molecular Genetics 7, 53 (1988).66. A. Bhasi, R. V. Pandey, S. P. Utharasamy, P. Senapathy,

Bioinformatics 23, 1815 (Jul 15, 2007).

Origin of gene regulatory and splicing structures 8

Copyright © 2010 by Periannan Senapathy, Genome International Corporation, Madison, WI 53717, USA

9

67. P. Senapathy, M. B. Shapiro, N. L. Harris, Methods Enzymol 183, 252 (1990).

68. M. B. Shapiro, P. Senapathy, Nucleic Acids Res 15, 7155 (Sep 11, 1987).

69. N. L. Harris, P. Senapathy, Nucleic Acids Res 18, 3015 (May 25, 1990).

70. A. Sandelin, Nat Rev Genet 8, 424 (2007).71. T. Shimada, Insect Mol Biol 1, 45 (1992).72. M. Di Giulio, J Theor Biol 253, 587 (Aug 7, 2008).73. A. Soma et al., Science 318, 450 (Oct 19, 2007).74. L. Randau, D. Soll, EMBO Rep 9, 623 (Jul, 2008).75. M. Di Giulio, EMBO Rep 9, 820; author reply 820 (Sep,

2008).76. K. Fujishima et al., Proc Natl Acad Sci U S A 106, 2683

(Feb 24, 2009).

Origin of gene regulatory and splicing structures 9

Copyright © 2010 by Periannan Senapathy, Genome International Corporation, Madison, WI 53717, USA

10

Supplementary Figures and Tables

Origin of gene regulatory and splicing structures 10

Copyright © 2010 by Periannan Senapathy, Genome International Corporation, Madison, WI 53717, USA

Supplementary Figure 1. | (A) The probability and EML of increasing lengths of protein-coding sequences (ORFs) in random DNA. The probability and the EML of increasing coding-sequence lengths were computed using the formula described in Methods. The plot shows that the probability decreases and the EML increases exponentially as the ORF length increases. (B) A log plot of the probability and EML for longer coding sequences.

Supplementary Figure 2. | The length of the coding sequence and the split gene as a function of increasing number of exons per gene. The number of exons, length of coding sequence, and length of the complete split gene for each gene in the human genome were tabulated from the EuSplice database (1). The number of genes with specified number of exons was recorded, and the average lengths of the coding sequence and the split gene for this dataset were computed. The average length of the coding sequence and the split gene were plotted as a function of increasing number of exons (bin size 10 exons) per gene (up to 80 exons per gene).

!"!

#"!$%$#!&

'"!$%$#!&$

("!$%$#!&

)"!$%$#!&$

*+,$-./0102$

!$

#!$

'!$

(!$

)!$

3!$

4!$

!$ '!!!$ )!!!$ 4!!!$ 5!!!$

,67$-*+,2$./010$

!

"

!"!$

!"'$

!")$

!"4$

!"5$

#"!$

!$ #!!$ '!!$ (!!$ )!!$ 3!!$ 4!!$ &!!$ 5!!$

8-9:; <2$

9:;-./0102$

=4!$

=3!$

=)!$

=(!$

='!$

=#!$

!$

,67$8$-9:; <2$

9:;-./0102$

!""#$%&'(")*"+),-%&".$/0$%+$

1""#$%&'(")*"&$%$

!

"

#

$

%&

%!

%"

&'

(&

%&&

%(&

!&&

!(&

)&&

)(&

"&&

&

%*%&' %%*!&' !%*)&' )%*"&' "%*(&' (%*#&' #%*+&' +%*$&'

,-./01.'2.3145'67'869:31

;.<=.38.'>%&)'?@A'

,-./01.'2.3145'67';@2:4'1.3.

;.<=.38.'>%&)'?@A'

B=C?./'67'DE63;'F./'G.3.'

11

Origin of gene regulatory and splicing structures 11

Copyright © 2010 by Periannan Senapathy, Genome International Corporation, Madison, WI 53717, USA

Supplementary Table 2. | Features of the random split-genes predicted by three different gene-prediction tools trained on different genomes (Augustus, GeneID and GenScan). Average values of genes predicted from 1000 different random DNA sequences (each 50KB in length) are shown.

Gene prediction program

Training genome

(Organism)

Average Number of

Genes

Average Gene

Length

Average Exon

Length

Average Intron Length

Average number of

introns per gene

Augustus

Human 2.8 10642 116 3622 2.2

Augustus

Arabidopsis 6.3 7227 145 250 17.2

Augustus C.Elegans 0.8 6505 164 850 5.8Augustus Drosophila 6.3 7224 145 250 17.2

Augustus

S.cerevisiae 0.02 370 219 109 0.4

GeneID

Human 0.2 24239 178 3303 11.5

GeneIDArabidopsis 0.1 21547 139 1688 24.8

GeneID C.Elegans 0.2 21908 164 3550 10.6GeneIDDrosophila 0.06 19447 87 7804 3.5

GeneID

Plasmodium 0 0 0 0 0

GenScanHuman 3.4 12979 152 1780 6.2

GenScan Arabidopsis 5.5 8550 119 649 11.9GenScanMaize 6.3 6010 60 2513 1.1

Supplementary Table 1. | Details of the four genome (organism) categories used in the study.

Genome Category Organism

Genome Size

(million bases)

Number of Genes

Average Gene

Length

Average Exon

Length (bases)

Average Intron Length (bases)

Average Number

of introns

Large genomes with intron-rich

genes

Homo sapiens 3300 ~25,000 10-15 kb 170 5419 7.4Large genomes with intron-rich

genes Sea Urchin 814 ~23,300 7.7 kb 115 750 7.3Small genomes with intron-rich

genesArabidopsis

thaliana 115 ~28,000 2 kb 254 413 6.5

Small genomes with intron-poor

genes

Drosophila melanogaster 123 13,379 11.3 kb 750 1411 3Small genomes

with intron-poor genes Caenorhabditis

elegans 100.3 19,427 4 kb 221 466 5

Highly intron-poor genes

Plasmodium falciparum 22.8 5,268 2515 bp 940 176 1Highly intron-

poor genes Saccharomyces cerevisiae 12.5 5,770 1.6 kb 1450

Supplementary Table 3. | The average lengths of protein-coding sequences and the corresponding ORFs that contain them in the genomes of different organisms (from 50 KB datasets).

OrganismAverage length of coding

sequence

Average length of

ORFs containingthe coding-sequence

Random 3115 3163

H. sapiens # 1825 1915

A. thaliana 5134 5161

C. elegans 4141 4197D. melanogaster 3163 3248

P. falciparum 6926 6955S. cerevisiae 5316 5316

# The values in 100 KB data sets are relatively longer

12

Origin of gene regulatory and splicing structures 12

Copyright © 2010 by Periannan Senapathy, Genome International Corporation, Madison, WI 53717, USA

Codons

Acceptor Splice Signal(-3 position)

Count (percent)

Acceptor Splice Signal(-3 position)

Count (percent)

Donor Splice Signal(+2 position)

Count (percent)

Donor Splice Signal(+2 position)

Count (percent)Codons

Human Random Human Random

TGA 3 (0) 0 (0) 43574 (28) 1543 (15)TAA 4 (0) 0 (0) 60416 (38) 3012 (30)TAG 45597 (29) 4106 (41) 11673 (7) 952 (10)

CAG 102260 (65) 4568 (46) 106 (0.06) 0 (0)

SBD 9362 (6) 1287 (13) 38950 (25) 4085 (41)OTHERS 433 (0.3) 26 (0.3) 2916 (2) 391 (4)

Supplementary Table 4. | The frequencies of stop codons at exon borders within splice signals in random genes and human genes.

A total of 157,659 acceptor and 157,635 donor sequences from the human and 9,987 acceptor and 9,983 donor sequences from the random genes were used. Acceptor and donor sequences for human were taken from 9,309 genes (reviewed and validated status) from EuSplice database (1), and from 1,778 random genes predicted from 1000 random sequences each of 50 KB length.