hcg: the database for hierarchical gene...
Post on 08-Aug-2020
26 Views
Preview:
TRANSCRIPT
HCG: a database for hierarchical classification of functionally equivalent genes in prokaryotes
Fenglou Mao*, Hongwei Wu*, Victor Olman, Ying Xu1
Computational Systems Biology Laboratory Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics
University of Georgia, Athens, GA 30602, USA *These authors contributed equally to this paper
1Correspondence author
Abstract Background: The existing gene annotation schemes generally classify genes into two-
levels of parallel and unrelated homologous and/or orthologous gene groups, limiting our
capabilities for gene function prediction at higher resolution. While homology and
orthology are useful concepts for evolutionary studies of genes, they may not be the most
appropriate ones for functional classification of genes, especially at a high-resolution
level.
Results: We present a new gene annotation database: the hierarchical classification
system of genes (HCG), which provides functional annotation of prokaryotic genes in
general at higher resolution than the existing functional classification schemes. The HCG
database consists of clusters, hierarchically organized, of functionally equivalent genes at
varying levels of resolution. Gene clusters at the top of the HCG hierarchy representing
homologous gene groups and descendent gene clusters representing functionally
equivalent genes at an increasingly higher resolution going down from the top to the leaf-
level clusters along the classification hierarchy. We also provide several examples to
demonstrate how HCG can be used to make specific gene function annotation. For each
HCG cluster, we provide a p-value assessing the statistical significance in grouping its
genes together, based on the functional relationship among its genes and their
relationship with genes outside of the cluster.
Conclusion: The HCG database, implemented using MySQL, currently consists of
658,174 genes, 51,205 clusters organized into 21,109 trees, from 224 prokaryotic
genomes. The on-line database supports four search capabilities, namely (1) browsing
HCG classification by trees, (2) browsing HCG classification by organisms, (3) querying
1
genes against the HCG database to find its gene cluster at the highest resolution possible
and its parent clusters if any, and (4) annotating sequences provided by a user.
1. Background
With the rapid accumulation of genome sequences along with their genes accurately
predicted, numerous efforts have been devoted to the computer-aided functional
annotation of genes, which have led to the development of a number of functional
classification schemes and associated databases such as Clusters of Orthologous Groups
(COG) [1], Pfam [2], and InterPro [3]. There are also other databases that integrate gene
annotation information with pathway information, such as Kyoto Encyclopedia of Genes
and Genomes (KEGG) [4], BioCyc[5] and the subsystem annotation environment SEED
[6]. While these and other functional classification schemes and databases provide highly
useful information for functional annotation of genomes, they are generally limited to
classification of genes into homologous and/or orthologous gene groups, although
homology and orthology are originally defined from evolution and don’t indicate gene
function relationship. The classification result of such schemes is generally represented as
a collection of parallel and unrelated functionally “equivalent” gene groups, providing a
two-level classification of functionally equivalent genes. We believe that the functional
relationship between genes can be better represented using a hierarchical system, which
is confirmed by recent development of Gene Ontology (GO) [7], which employs a DAG
(Directed Acyclic Graph) structure, more general than a hierarchical structure. Generally
gene function classifications can be grouped into two classes: two-level classification
such as COG, KEGG orthologs and Pfam and multi-level classification such as GOA and
our classification scheme HCG.
The Gene Ontology Annotation (GOA) Database [8] is the only database that
employs multi-level classification of for gene functions up until now. GOA annotates
genes using GO terms so it stands on a solid ground for function classification. However
most annotations in GOA are extracted from UniProt and InterPro by using three scripts
(ec2go, skpw2go and InterPro2go), and others are annotated manually with the help of
annotation tools such as GOAnnotator, thus it is hard to evaluate the annotation quality.
There are other genome databases with gene annotation information, such as the
integrated microbial genomes (IMG) system [9] and Integr8 [10]. While useful, the gene
2
annotation in IMG is created through using rather simple methods, namely RPS-BLAST
(reverse position specific BLAST) and bidirectional best hits, which is widely thought to
be inaccurate [11], have low sensitivity [12] and yield high false positive rates [13], and it
also adopts the two level of classification strategies such as Pfam and COG. Integr8 also
used the annotation from other database such as InterPro and Pfam.
We have developed a functional classification scheme for prokaryotic genes,
based on both sequence similarity information and genomic neighborhood information
[14]. A key unique feature of this classification scheme is that it classifies genes into
functionally equivalent clusters at multiple resolution levels, and these clusters are either
parallel-to each other or inside-of one another, hence giving rise to a multi-level
hierarchical structure, under which genes could have “equivalent” functions measured at
varying resolution. For example, genes in any root-level cluster, in this functional
hierarchy, are functionally equivalent in the sense that they are homologous, and genes in
any lower-level cluster represent a group of functionally equivalent genes with higher
specificity (or higher resolution). The functional equivalence relationships among genes
at different resolution are derived based on a two-level classification scheme [14]. The
algorithm first derives the functional relationships among individual gene pairs based on
their sequence similarity and their co-location information in genomes, and then derives
the functional relationships among a group of genes by detecting the groups of genes with
high densities of pair-wise functional relationships within each group versus the
(relatively) lower densities of relationships between each gene group and genes outside of
the group. For each predicted gene cluster (group), we also provide a p-value to measure
how standout the cluster is in the background where these genes sit. In some sense, this
value also reflects the consistency of annotation of gene groups, or called annotation
quality.
By applying this classification scheme to genes of 224 prokaryotic genomes, we
have established a database, HCG, of functionally equivalent gene clusters. Intuitively,
the HCG system can be viewed as a “forest” of trees, where each tree consists of a root-
level cluster and its descendent clusters, possibly at different levels. For each cluster in
the HCG system, we have provided an annotation to characterize the common biological
function of the cluster, based on the Gene Ontology (GO) annotation (GOA Proteome
3
Sets) and NCBI gene-product description. Other information such as Pfam and COG
annotation is also provided for cross-reference purposes.
2. Construction and Content
2.1 The Construction of the Database
The HCG database currently consists of the classification result from 224 complete
prokaryotic genomes (released of NCBI, 03/05/2005). While the detailed description of
the clustering algorithm and an analysis of the data has been published elsewhere [14],
we here outline the procedure for database construction and application. The HCG
system has been created using the following steps:
(a) All homologous gene pairs are identified using reciprocal BLASTP [15] with e-
values < 1 for both directions of the search against all the 658,174 genes.
(b) The Smith-Waterman algorithm [16] is performed on all homologous gene pairs
selected from (a) to obtain a multi-value feature vector for each homologous
gene pair, representing the quality of their sequence alignment.
(c) A positive training set consisting of orthologous gene pairs as well as a negative
training set consisting of homologous but non-orthologous gene pairs is created
for the purpose of training a classifier (see [14] for details) .
(d) A parameterized linear classification function is employed to discriminate
orthologous genes from homologous but non-orthologous genes, whose
parameters are selected so that the classification function optimally
discriminates the positive from the negative training data.
(e) A scoring scheme is developed to measure the functional equivalence between
two genes based on the sequence similarity information derived from (d) and
genomic neighborhood information derived based on three operon prediction
programs, namely (i) VIMSS [17], (ii) JPOP [18, 19], and (iii) GeneChords [20].
(f) A graph representation is constructed to represent all the 658,174 genes from
224 prokaryotic genomes and their functional equivalence relationship defined
in (e).
4
(g) A graph-partition algorithm is applied to the representing graph of these genes
and their functional relationships to generate a collection of dense sub-graphs
(and sub-sub-graphs, etc), each of which represents a gene cluster. These gene
clusters form a hierarchical structure. For each cluster, a p-value is calculated to
assess its statistical significance.
(h) Each gene cluster is annotated using a set of keywords and GO terms, based on
common features of the NCBI and GO annotations [10] of individual genes of
the cluster, where the keywords are extracted from the NCBI description of each
gene product, and the GO terms for each cluster are selected based on a
majority-rule vote among GO assignments to individual genes in the cluster.
(i) All gene-classification data is integrated into a MySQL database; and a web
server is created at http://csbl.bmb.uga.edu/HCG to facilitate searching and
accessing the database.
The validity of the predicted gene clusters are checked through comparing the HCG
classification against the genome taxonomy, COG classification [1] and Pfam
classification [2] of genes. The detailed validation procedure and results are given in [14].
2.2 Database Tables
To store the tree structure of the HCG system in a MySQL relational database, we have
designed two tables, Node and Edge shown in Figure 1, to represent the HCG clusters
and the parent-child relationship. Other information such as gene attributes, cluster
annotation, and the p-values of each cluster are also stored in the MySQL tables. Figure 1
shows the relationship among the tables. The table “Gene” is used to store the
information of individual genes, such as gene attributes. The tables “GO”, “Node_GO”
and “Gene_GO” are used to store GO terms, GO annotation for individual genes and GO
term-based annotation for individual clusters, respectively. The table “Gene_Node” is
used to store the genes in each cluster, and the table “Species” is used to store species
information of a genome. There are several additional internal tables that are not
described in Figure 1 and are omitted for further discussion.
2.3 Information Available at HCG
5
HCG stores and facilitates accessing the basic information about each gene in its database,
including a gene’s position in a genome, PID, locus tag, chain ID, COG number, gene
product description, gene name, sequence, etc, all extracted from the NCBI database. In
addition, we have run COGNITOR [21] to generate the COG numbers for all genes,
including both functionally assigned and unassigned by the NCBI database. So for the
vast majority of the genes in HCG, we have COG numbers. We have also integrated the
GO annotations and Pfam accession ID into the HCG database in a similar fashion.
In addition to the information extracted from other data sources, HCG has a large
quantity of its own data. At the highest level, HCG is a forest of trees, each being a
collection of gene clusters that are either parallel-to or part-of each other. At the top-
level of each tree is a cluster containing all genes in the tree, which are homologous to
each other. Each lower-level cluster consists of genes that are functionally more
equivalent than the genes in the parent cluster. For each cluster, we have calculated a p-
value to estimate the statistical significance of having the genes in this cluster forming an
outstanding cluster in the background of other genes [14].
For each gene cluster, we assign its functional annotation using two methods.
First, we assign GO terms to each cluster based on a majority-rule vote using the GO
annotations of individual genes in the cluster [14]. For each HCG cluster, some
individual genes have been annotated by GOA, one or more consensus GO terms are
generated and the consensus GO terms are used to annotate the cluster. A probability
value is calculated for each of the consensus GO terms, which can be used to assess the
reliability of each function assignment – the higher the probability, the higher the
prediction reliability. We have also assigned text descriptions to each gene cluster, which
are derived from the NCBI gene product descriptions of individual genes, and used to
describe the overall function of the cluster. For each cluster, we calculate a consistency
score between 0 and 1, measuring the consistency among the NCBI descriptions for the
individual genes of the cluster, with 1 representing the most consistent and 0 representing
the least consistent. A detailed description of the algorithm is given in [14]. A user can
use both the cluster GO annotation and the text description to infer the function of genes
assigned to each cluster.
6
2.4 HCG Data Statistics
The HCG database consists of 658,174 genes from 224 genomes, including 376 DNA
chains (both chromosomes and plasmids) from NCBI (release of 03/05/2005). Among the
658,174 genes, 609,887 genes are assigned with HCG codes. 139,495 genes have COG
numbers extracted from the NCBI database, and 459,955 genes are assigned with COG
numbers by running COGNITOR [21]. When comparing the COGNITOR-calculated
COG numbers with the NCBI-assigned COG numbers, we have noticed that only
108,620 genes have the same COG numbers, and other 30,875 genes have different COG
numbers. This inconsistency most likely comes from the multiple COG numbers returned
by COGNITOR. 318,326 genes have been assigned with GO terms in [10].
HCG has 51,205 clusters of genes (they are numbered consecutively in an arbitrary
manner so are the sub-clusters and sub-sub-clusters, etc), organized into 21,109 HCG
trees. Among these trees, 2,092 trees have more than 50 genes, totaling 518,703 genes.
10,716 trees are annotated with text descriptions, covering 568,717 genes. 4,877 trees are
annotated with cluster GO terms, covering 500,996 genes. 4,330 trees have both cluster
GO terms and text descriptions, covering 497,350 genes. 182,670 genes that are not
annotated in Integr8 [10] are successfully annotated by HCG; and for those genes that are
annotated by both HCG and Integr8, most of them are annotated with more specific GO
terms in HCG than in Integr8. By combining both the text description and cluster GO
annotation, a clear function description of each gene can be inferred.
The HCG database is implemented using MySQL 4.0.18, running on a SuSE 9.0
linux computer with 4GB memory and two 2.8GHz XEON processors. A web interface,
which is hosted by an Apache 2.0.40 web server, is developed to facilitate access to the
database through the Internet. PHP server-side script language is used to create dynamic
web pages. The response time for browsing most pages of the HCG database server is
less than one second, while the response time of the “query” page depends on the
complexity of the query, which is typically within a couple of seconds.
3. Utility and Discussion
3.1 Web Access
7
The HCG database can be accessed at http://csbl.bmb.uga.edu/HCG. A user can retrieve
data using one of the following four methods. The first one is to browse HCG in a
hierarchical way. The user can start from the virtual root of the “forest” to list all the trees.
From this list, the user can select a tree that he/she may want to browse, and then go to its
off-springs. The second method is to browse the gene annotation for each species. The
user can select a specific species and a chain, and browse the HCG annotation page by
page. The third method is to search the HCG database for genes using keywords selected
from a pre-prepared list of fields. The user can specify the value of any gene attribute,
such as the words in the product description, the HCG number of the genes, or a species
name, etc. The user can also create a combination of these conditions by using “AND”
and “OR”. In the fourth method, the user can submit his/her own protein sequence to the
server to find the related HCG ids, and then annotate the sequence using the GO numbers,
text descriptions associated with the returned HCG id. Figure 2 shows a workflow for
page browsing and a few screen shots of using HCG.
3.2 Gene Annotation at Multiple Resolutions by HCG
As discussed in [14], the multi-level classification scheme provides substantially more
information than the one- or two-level classification schemes such as COG [1] and
Pfam[2],.
Figure 3 shows the structure of the HCG tree rooted at cluster “HCG-21” and its
descendent clusters. Among the 1,294 genes included in cluster HCG-21, 1,089 genes are
assigned with GO terms; and 98.3% and 97.6% of the 1,089 genes are annotated as
GO:0000155 (two-component sensor activity) and GO:0005524 (ATP binding activity),
respectively. Hence the biological functions of the HCG-21 genes can be summarized
using GO:0000155 and GO:0005524; and those HCG-21 genes without an identified
biological function are predicted to have the biological functions defined by the cluster,
i.e., GO:0000155 and GO:0005524.
Comparing to these GO annotations assigned to the root-level cluster, the hierarchical
structure of HCG-21 provides much richer functional information to genes in the lower-
level sub-clusters of this cluster. For example, a large portion of genes in HCG-21 are
further partitioned into 38 child-level clusters labeled as “HCG-21.0” to “HCG-21.37”.
8
The numbers of genes in these clusters range from 3 to 91. Almost all of these child-level
sub-clusters are annotated with more specific functions, using GO terms and NCBI-based
text description than their parent cluster “HCG-21”.
As we demonstrate using the following examples, genes in the same child cluster do
have stronger functional relationship than the relationship among genes in the parent
cluster. Cluster “HCG-21.0” contains 91 kdpD genes, all of which are the sensor genes
for high-affinity potassium transport system; and cluster “HCG-21.4” contains 46 phoR
genes, which are all the sensor genes in the phosphate regulons. Some of the other child-
level clusters each contain genes of similar but distinct biological functions, which are
then further divided into a group of grandchild-level sub-clusters containing genes with
equivalent functions with higher resolution. For example, the cluster “HCG-21.3”
contains 49 genes annotated as either “cpxA” (the envelope stress sensor genes) or “envZ”
(the osmolarity sensor genes). In its child level, the genes of “HCG-21.3” are further
grouped into two smaller sub-clusters, “HCG-21.3.0” and “HCG-21.3.1”, which contains
“cpxA” and “envZ” genes, respectively. The fact that these “cpxA” and “envZ” genes are
grouped in the same cluster “HCG-21.3” suggests that the cpxA and envZ genes are more
equivalent to each other than they are to other genes, which is supported by their NCBI
annotation, where both “cpxA” and “envZ” genes are annotated to sense the extracellular
pressure, and “envZ” genes are to sense the pressure from water (i.e., osmolarity). Similar
can be said about another child-level cluster “HCG-21.2”, which contains 52 genes
annotated as either “vanS” or “resE”. In the grandchild level, these “vanS” and “resE”
genes are further grouped into two smaller clusters, “HCG-21.2.0” and “HCG-21.2.1”,
which contains “vanS” and “resE” genes, respectively. Among the 1,294 HCG-21 genes,
689 cannot be further grouped into lower-level clusters, suggesting that these genes can
only be annotated at low resolution, i.e., “two-component sensor activity” and “ATP
binding activity”, because of the high functional diversity of these genes.
Interestingly while the annotation derived from NCBI descriptions match well with
our gene clusters, the GO annotations we derived from the GOA database are not as
specific. For example, most genes in cluster “HCG-21” are assigned with two GO terms:
GO:0000155 (two-component sensor activity) and GO:0005524 (ATP binding activity),
so we cannot make any specific GO assignment for any of the offspring clusters of
9
“HCG-21”. However since we have used different information sources in our
gene/cluster annotation, we have achieved annotations with higher specificity. This also
indicates that to get more specific gene function annotation, one should look at more
information sources. It should be noted that though our GO-based and NCBI-based
annotations do not have any conflict, in general GOA-based annotation is not as specific
as the NCBI-based ones.
3.3 Application Examples
We now illustrate how to use the HCG database and demonstrate the power of the HCG
system for functional prediction of genes, using the following examples.
Example 1: find the function of a gene. Suppose we want to find out the function of
gene “GI-16801886” of Listeria innocua Clip11262. The gene product is labeled as a
“hypothetical protein” in the NCBI database. The COG number of this gene is COG0745,
which represents the gene class of “response regulators consisting of a CheY-like
receiver domain and a winged-helix DNA-binding domain”. Clearly, this annotation is
not particularly useful as there are 3,119 genes assigned with this COG number across the
224 genomes covered by HCG. The GO annotation of this gene is GO:0000156 (two-
component response regulator activity) and GO:0003677 (DNA binding), which is not
very specific either as 3,866 genes in HCG are annotated with both GO terms. To use
the HCG system to derive more specific functional information of this gene, a user can
use the following steps.
1) Go to the HCG main page at http://csbl.bmb.uga.edu/HCG, and then click the link
“Search” to bring up the “Query Builder” page.
2) Fill the query information with “GI == 16801886”, and leave the other entries
blank. Then click “Submit” to query the database.
3) The search will return the HCG code of gene 16801886 as “10.3.0” in the result
page. Then click the link “10.3.0” to bring up the annotation page for this HCG
cluster.
4) In the annotation page of “10.3.0”, the gene name for “10.3.0” is “kdpE”, and
there are also two descriptions about the specific function of “10.3.0”: i) “kdp
10
operon transcriptional regulatory protein kdpE”; ii) “two-component regulatory
protein response regulator kdpE”; iii) putative turgor pressure regulator; iv)
probable transcriptional regulator. The first two annotations indicate more specific
function while the other two indicate a general function. HCG has extracted all
four descriptions because the score for all of them are above our threshold.
Clearly HCG provides much more specific functional information about this gene than
the other functional classification databases.
Example 2: find a gene which carries a specific function. Suppose we want to find out
which gene encodes the protein “bioA” in Vibrio fischeri ES114, an important gene in
biotin synthesis. We know the another name of bioA is “7,8-diaminopelargonic acid
synthetase”. To find “bioA” in “Vibrio fischeri ES114”, the user needs to do the following.
1) First we need to find out which HCG cluster represents “bioA” genes. To do this,
go to the HCG main page at http://csbl.bmb.uga.edu/HCG, and then click the link
“Search” to bring up the “Query Builder” page.
2) Set the query information with “(Gene == bioA) OR (Product include 7,8-
diaminopelargonic)”, and leave the other entries blank. Then click “Submit” to
query the database. To construct the query, the user needs to click the checkbox
corresponding “(“ and “)” in condition1 and condition2. It also needs to click the
radio button corresponding “Or” in condition2.
3) The user should now see the returned gene cluster labeled as cluster “69”, and
some of its genes further clustered into “69.1”, “69.4”, “69.5” and “69.8”, etc.
Many of these genes are annotated as “adenosylmethionine-8-amino-7-
oxononanoate (KAPA) aminotransferase”. It should be noted that the reactant of
“7,8-diaminopelargonic acid”(DAPA) synthesis reaction is “7-keto-8-
aminopelargonic acid” (another name of KAPA). Since these genes are from
several different bacterial genomes, one needs to find the gene in the right
genome. The user should click the link “69” to bring up the annotation page of its
HCG annotation.
4) By checking the annotation pages of HCG cluster “69” and some annotation
pages of its children like “69.1”, “69.4”, “69.5” and “69.8”, the user can see that
11
the children clusters are annotated as “bioA”. By checking the genes in the
children clusters, the user should be able to see why they are further clustered;
that is because the genes in same cluster belong to closer species.
5) Therefore one can determine that some children clusters of “69” are related to
“bioA”, and their parent cluster “69” might include “bioA” homologs. Now the
user should go back to the “Query Builder” page at
http://192.168.0.3/HCG/query_builder.php, and enter the query “(HCG
Begin_With 69.) AND (Species_Name include Vibrio fischeri ES114)”, and
submit.
6) In the result page, the user should be able to see three genes NCBI:59712891,
NCBI:59713931 and NCBI:59714306 in cluster “69”. Their HCG codes are
“69.2”, “69.1.0.0” and “69.6”, respectively. By checking the annotation of these
three HCG clusters, only “69.1.0.0” is annotated as “bioA”, the user should be
able to confidently conclude that gene NCBI:59713931 encodes the “bioA” in
Vibrio fischeri ES114, and its enzyme name is either “7,8-diaminopelargonic
acid(DAPA) synthetase” or “adenosylmethionine-8-amino-7-
oxononanoate(KAPA) aminotransferase”.
Example 3: annotate the function of new genes from a newly genome. Two new
cyanobacterial genomes have been recently sequenced by Grossman’s lab (personal
communication), and these genomes are not included in current release of HCG. Here we
use gene NCBI:86604767 as an example to illustrate how to use HCG to annotate the
function of a new gene.
1) Go to the HCG main page at http://csbl.bmb.uga.edu/HCG, and then click the link
“MyHCG” to bring up the sequence input page;
2) Enter the sequence of gene NCBI: 86604767, and click “submit”;
3) HCG returns 10 genes in the database as hits with cluster “6514.” ranked as the
No. 1 hit.
4) Click the link “6514.” to open the annotation page for this cluster, and we found
that the description is “photosystem I subunit XI” and the gene name is “psaL”;
12
5) The user can also click the link “Display Hit Genes” to display all the hit genes;
and the descriptions for these genes are “photosystem I subunit XI” or
“photosystem I reaction center subunit XI”;
6) Both the function information obtained from 4) and 5) can be used to annotate the
gene: NCBI:86604767.
A user can also send the sequence to COGNITOR. For this example, it returned “NO
related COG”, suggesting that COG does have its annotation. We have also sent the
sequence to the Pfam server, which returned “PF02605”, representing “Photosystem I
reaction centre subunit XI”, which is consistent with the HCG annotation. We noted that
KEGG doesn’t allow such data retrieval.
4. Conclusion
We have developed a database, HCG, for hierarchical classification of functionally
equivalent genes, which can be used to annotate genes at multiple resolution, depending
on the availability of related data. The HCG system is based on a new method for
prediction of functional relationship through combining information of sequence
similarity and genomic context. The hierarchical organization of genes, grouped together
with other functionally equivalent genes, facilitates functional annotations of new genes
with higher accuracy compared to other functional classification schemes. We plan to
extend this system to include all complete prokaryotic genomes, in the very near future,
and update it on regular basis (monthly). We expect that this new system for gene
annotation will provide a powerful tool for genome analysis and annotation to the
biological community.
Availability and requirements
The database can be accessed at http://csbl.bmb.uga.edu/HCG, the users who want to
analysis the whole database can download the classification data at
http://csbl.bmb.uga.edu/HCG/HCG.tar.gz. The database is freely available for academic
users; non-academic users should contact the corresponding author to obtain a license.
Any modern Internet Browser should be capable of using the online database server.
13
Authors' contributions
Fenglou Mao designed the database and implemented the online server; Fenglou Mao and
Hongwei Wu worked together to generate the data of HCG; Victor Olman designed the
hierarchical clustering program; Ying Xu coordinated the whole procedure and provided
the financial support.
Acknowledgement
This work was supported in part by National Science Foundation (NSF/DBI-0354771,
NSF/ITR-IIS-0407204, NSF/DBI-0542119) and by a “Distinguished Scholar” grant from
the Georgia Cancer Coalition.
Reference
1. Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective on protein families. Science 1997, 278(5338):631-637.
2. Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R et al: Pfam: clans, web tools and services. Nucleic Acids Res 2006, 34(Database issue):D247-251.
3. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bradley P, Bork P, Bucher P, Cerutti L et al: InterPro, progress and status in 2005. Nucleic Acids Res 2005, 33(Database issue):D201-205.
4. Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for deciphering the genome. Nucleic Acids Res 2004, 32(Database issue):D277-280.
5. Keseler IM, Collado-Vides J, Gama-Castro S, Ingraham J, Paley S, Paulsen IT, Peralta-Gil M, Karp PD: EcoCyc: a comprehensive database resource for Escherichia coli. Nucleic Acids Res 2005, 33(Database issue):D334-337.
6. Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M, de Crecy-Lagard V, Diaz N, Disz T, Edwards R et al: The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res 2005, 33(17):5691-5702. Print 2005.
7. Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C et al: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 2004, 32(Database issue):D258-261.
8. Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res 2004, 32(Database issue):D262-266.
9. Markowitz VM, Korzeniewski F, Palaniappan K, Szeto E, Werner G, Padki A, Zhao X, Dubchak I, Hugenholtz P, Anderson I et al: The integrated microbial genomes (IMG) system. Nucleic Acids Res 2006, 34(Database issue):D344-348.
14
10. Kersey P, Bower L, Morris L, Horne A, Petryszak R, Kanz C, Kanapin A, Das U, Michoud K, Phan I et al: Integr8 and Genome Reviews: integrated views of complete genomes and proteomes. Nucleic Acids Res 2005, 33(Database issue):D297-302.
11. Fulton DL, Li YY, Laird MR, Horsman BG, Roche FM, Brinkman FS: Improving the specificity of high-throughput ortholog prediction. BMC Bioinformatics 2006, 7:270.
12. Wall DP, Fraser HB, Hirsh AE: Detecting putative orthologs. Bioinformatics 2003, 19(13):1710-1711.
13. Mao F, Su Z, Olman V, Dam P, Liu Z, Xu Y: Mapping of orthologous genes in the context of biological pathways: An application of integer programming. Proc Natl Acad Sci U S A 2006, 103(1):129-134.
14. Wu H, Mao F, Olman V, Xu Y: Hierarchical Classification of Functionally Equivalent Genes of Prokaryotes. accepted by Nucleic Acids Research 2007, 0(0):0.
15. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389-3402.
16. Smith TF, Waterman MS: Comparison of biosequences. Advances in Applied Mathematics 1981, 2(4):482-489.
17. Price MN, Huang KH, Alm EJ, Arkin AP: A novel method for accurate operon predictions in all sequenced prokaryotes. Nucleic Acids Res 2005, 33(3):880-892. Print 2005.
18. Chen X, Su Z, Dam P, Palenik B, Xu Y, Jiang T: Operon prediction by comparative genomics: an application to the Synechococcus sp. WH8102 genome. Nucleic Acids Res 2004, 32(7):2147-2157.
19. Chen X, Su Z, Xu Y, Jiang T: Computational Prediction of Operons in Synechococcus sp WH8102. Proceedings of 15th International Conference on Genome Informatics 2004:211-222.
20. Zheng Y, Anton BP, Roberts RJ, Kasif S: Phylogenetic detection of conserved gene clusters in microbial genomes. BMC Bioinformatics 2005, 6:243.
21. Tatusov RL, Galperin MY, Natale DA, Koonin EV: The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res 2000, 28(1):33-36.
15
Figure 1: HCG database table relationship
16
Figure 2: A screenshot of the HCG browser
17 17
Figure 3: The tree structure of cluster HCG-21, consisting of a group of two-component sensors. A circle represents a cluster which cannot be further divided; a rectangle represents a cluster containing only genes from the same genome; a triangle represents a cluster that does not have genes from the same genome. Colors do not have any particular meaning here.
18
top related