an extensible network query unification system for ... filewarehouse approach are systems like igd,...

6
Vol. 12 no. 2 1996 Pages 145-150 An extensible network query unification system for biological databases D.Curtis Jamison 1 ' 3 , Brad Mills 1 and Bruce Schatf Abstract Database federation enables biological researchers to utilize resources more effectively, creating an environment in which the researcher can query multiple data sources without spending time learning new query mechanisms or issuing redundant queries which need to be integrated. Several mechanisms exist to federate databases. The ENQUire system is a network database federation system which uses a World-Wide-Web (WWW) interface to connect the users to various databases. Generic queries entered via a query generator form are sent in parallel to multiple databases, and the results are presented to the user in a unified format. All forms building, query generation, and results translation is done on the fly, and individual database translation modules can be added dynamically. ENQUire is a flexible answer to the problems of database federation on the WWW. Introduction A revolution is occurring in the biological sciences. Genome sequencing projects are under way for a variety of organisms, including the nematode (Wilson, 1994), yeast (Levy, 1994), and human genomes (Olson, 1993), and the first complete genome (Haemophilus influenzae) (Fleischmann et al., 1995) has recently been published. Sequence data is growing exponentially, and as sequencing projects proceed, the nature of bench work will change, becoming increasingly dependent upon information retrieval. Currently, biological researchers rely heavily upon sequence databases like GenBank, PIR, Swiss-Prot and PDB which contain sequence information for all species; literature databases like MEDLINE and Biosys; and species-specific genetic databases like GDB, FlyBASE, and ACeDB. While most of these sources are easily accessible by ordinary computers, they are accessed by 1 National Center for Supercomputing Applications and 2 Community Systems Laboratory, National Center for Supercomputing Applications, University of Illinois, Urbana-Champaign, UrbanaJL 60801, USA 3 To whom corrspondence should be addressed. Present address: National Agriculture Library, USDA, 10301 Baltimore Blvd., Beltsville, MD 20705, USA. E-mail: [email protected] disparate query mechanisms, forcing the researcher to understand several different query languages and making the integration of results from different databases difficult. The federation of biological databases is an important step toward providing a single biological dataspace. A large obstacle to database federation is the variety of data models and schemas used by the various databases. Data models range from flat-file formats left over from the days of card decks to the latest in object-oriented technology. Schemas differ greatly both in data represen- tation and in data emphasis. The challenge to database federation becomes one of forging links or combining databases without losing information. There are two general approaches to database federa- tion. The warehouse approach brings the databases to one location, usually producing a tight federation by pro- ducing one or more secondary databases. Examples of the warehouse approach are systems like IGD, ENTREZ and OWL. These are non-redundant federations which trans- form heterogeneous data models and schemas into a single format (Bleasby and Wooten, 1990; Boguski, 1994; Ritter, 1994). Alternatively, a virtual federation approach produces a loose federation by creating links between related data- base entries. Examples of this approach are systems like DBGET and SRS. DBGET relies upon a secondary database of similarity links (Akiyama and Kanehisa, 1995), .while SRS creates indices to provide cross- references (Etzold and Argos, 1993). The Worm Community System (WCS), a community system for C.elegans researchers, used the warehouse approach to federate information from ACeDB, literature sources, and informal (uncurated) user submissions (Shoman et al., 1995). WCS was a pioneer federation system, using a very tight federation to produce a uniform user environment and dataspace. Despite the success of user interface, there were several drawbacks in the data base used to support the front end. The primary problem involved updating the federation, especially as the ACeDB data schema changed through multiple iterations. When the decision was made to migrate from the custom, single- use WCS user interface to a more generic WWW-based interface, a virtual federation using remote data sources was pursued. ) Oxford University Press 145

Upload: ngoliem

Post on 09-May-2019

217 views

Category:

Documents


0 download

TRANSCRIPT

Vol. 12 no. 2 1996Pages 145-150

An extensible network query unification systemfor biological databases

D.Curtis Jamison1'3, Brad Mills1 and Bruce Schatf

Abstract

Database federation enables biological researchers to utilizeresources more effectively, creating an environment in whichthe researcher can query multiple data sources withoutspending time learning new query mechanisms or issuingredundant queries which need to be integrated. Severalmechanisms exist to federate databases. The ENQUiresystem is a network database federation system which uses aWorld-Wide-Web (WWW) interface to connect the usersto various databases. Generic queries entered via a querygenerator form are sent in parallel to multiple databases,and the results are presented to the user in a unified format.All forms building, query generation, and results translationis done on the fly, and individual database translationmodules can be added dynamically. ENQUire is a flexibleanswer to the problems of database federation on theWWW.

Introduction

A revolution is occurring in the biological sciences.Genome sequencing projects are under way for a varietyof organisms, including the nematode (Wilson, 1994),yeast (Levy, 1994), and human genomes (Olson, 1993),and the first complete genome (Haemophilus influenzae)(Fleischmann et al., 1995) has recently been published.Sequence data is growing exponentially, and as sequencingprojects proceed, the nature of bench work will change,becoming increasingly dependent upon informationretrieval.

Currently, biological researchers rely heavily uponsequence databases like GenBank, PIR, Swiss-Prot andPDB which contain sequence information for all species;literature databases like MEDLINE and Biosys; andspecies-specific genetic databases like GDB, FlyBASE,and ACeDB. While most of these sources are easilyaccessible by ordinary computers, they are accessed by

1 National Center for Supercomputing Applications and2Community Systems Laboratory, National Center for SupercomputingApplications, University of Illinois, Urbana-Champaign, UrbanaJL 60801,USA3 To whom corrspondence should be addressed. Present address: NationalAgriculture Library, USD A, 10301 Baltimore Blvd., Beltsville, MD 20705,USA. E-mail: [email protected]

disparate query mechanisms, forcing the researcher tounderstand several different query languages and makingthe integration of results from different databases difficult.The federation of biological databases is an important steptoward providing a single biological dataspace.

A large obstacle to database federation is the variety ofdata models and schemas used by the various databases.Data models range from flat-file formats left over from thedays of card decks to the latest in object-orientedtechnology. Schemas differ greatly both in data represen-tation and in data emphasis. The challenge to databasefederation becomes one of forging links or combiningdatabases without losing information.

There are two general approaches to database federa-tion. The warehouse approach brings the databases to onelocation, usually producing a tight federation by pro-ducing one or more secondary databases. Examples of thewarehouse approach are systems like IGD, ENTREZ andOWL. These are non-redundant federations which trans-form heterogeneous data models and schemas into a singleformat (Bleasby and Wooten, 1990; Boguski, 1994; Ritter,1994).

Alternatively, a virtual federation approach produces aloose federation by creating links between related data-base entries. Examples of this approach are systems likeDBGET and SRS. DBGET relies upon a secondarydatabase of similarity links (Akiyama and Kanehisa,1995), .while SRS creates indices to provide cross-references (Etzold and Argos, 1993).

The Worm Community System (WCS), a communitysystem for C.elegans researchers, used the warehouseapproach to federate information from ACeDB, literaturesources, and informal (uncurated) user submissions(Shoman et al., 1995). WCS was a pioneer federationsystem, using a very tight federation to produce a uniformuser environment and dataspace. Despite the success ofuser interface, there were several drawbacks in the database used to support the front end. The primary probleminvolved updating the federation, especially as the ACeDBdata schema changed through multiple iterations. Whenthe decision was made to migrate from the custom, single-use WCS user interface to a more generic WWW-basedinterface, a virtual federation using remote data sourceswas pursued.

) Oxford University Press 145

D.C.Jamison, B.Mills and B.Schatz

Client

ENQUire

Query ResultsTranslator Translator

Communications

Database 1 Database 2

Fig. 1. ENQUire Query Flow. The client submits a generic query to theENQUire gateway (center box), which translates the query into databasespecific queries. These queries are then routed to the communicationsmodule which transmits them to the target databases. The resultsreturned to the gateway are translated and returned to the client. Eachbox represents a program or a database server residing on a differentcomputer system.

The resulting technology proved to be a relativelysimple answer to database query federation on the WWW.The system was extended beyond the data sourcesoriginally found in WCS, and the result is the ENQUire(Extensible Network Query Unifier) system (URL: http://csl.ncsa.uiuc.edu/ENQUire), a scalable HTTP gatewayfor querying distributed databases. ENQUire providesdatabase query federation at the syntactic level, andcreates a browsable dataspace in which databases arelinked at the semantic level. The federation is done on thefly, and the method of adding databases into the system issimple enough to allow new databases to be added quickly(two to three days, on average).

System and methods

ENQUire is a CGI application written in perl (Wall andSchwartz, 1991). It requires a forms capable WWWbrowser such as Mosaic or Netscape as a client, and itrelies upon translation libraries to translate from a genericquery language into database-specific query languages.Translated queries are submitted to the appropriatedatabase, then the responses are collected and collatedprior to being interpreted into HTML format andreturned to the user. This flow is illustrated in Figure 1.

Perl5 was chosen as the implementation language basedon the strength of the string intrinsics, which are betterthan those found in many languages. Perl5 also offersobject-oriented programming. A third factor in favor ofperl5 is the interpretive nature of the language which, at a

Table I. Required subroutines

Subroutine Action Parameters

<DB>_qt

< DB > _rt

< DB > _os

< DB > _ot

< DB > _return_class

< DB > _query_class

<DB> cm

Translates from generic query todatabase specific query

Translates and summarizes results intoHTML anchors pointing at object server

Translates a link to a specific databaseobject into a corresponding query

Translates a single item into perl5 object

Returns a comma separated list of allpossible return types

Returns a comma separated list of allpossible search classes or types

Optional communications module

INPUT:

OUTPUT:INPUT:OUTPUT:

INPUT:

OUTPUT:

INPUT-OUTPUT:

INPUT:OUTPUT:

INPUT:OUTPUT:

INPUT:OUTPUT:

search listclass list,term list,query list

result stringresult list

class stringterm stringquery string

result stringHTML string

noneclass list

noneclass list

query listresult string

These subroutines are required to create a translation protocol library. The action performedparameters and the required output. The reference' < DB > ' indicates the name of the database,the subroutines 'ACeDB.qt' etc.

by the subroutine is indicated, along with the requirede.g. the ACeDB translator file 'ACeDB_bpl.pl' contains

146

ENQUire: a network query unification system

File Options Navigate Annotate News

Title: |ENQUire D

ENQUreDaU

Use this form to seleaccording to type. Llatitude in the numbidatabases will limit twhich can be scare!limits you to one typ

Literature Datal

U All U WBG

Sequence Dataiw

Q All G ACeD

(Submit Query| |R

An experimental bribrowser does not w

CSLpi

:Back||' :;"• V-.||Hor

NCSA Mosaic: Document View

File Options Navigate Annotate News Help

File Options Navigate Annotate

Title: Database Select Parameters

URL: Ihttp://rohe.ncsa.uluc.edu.

ENQUre Query Builder

This form allows you to build your query. If thedoes not appear in the list, consider returning tireducing the number of disparate databases to

You have selected the following databases: A(

Find all i Sequence —i | where

-><

U (

-J (

-I (

- I (

- I (

- I (

Gene —i

Person -i

Person —i

Person —i

Person —<

Person —>

Person -i

Submit Query| [Reset|

-daf-4

Title: |cQI/t. 1 ENQUire Perl script

URI Jjittp: //rohe .ncsa. uluc. edu/cg1-: -e. p SRetrieving Class Sequence with Query:Gene« daf-4

ACeDB Results

Resulting Record Count:Query matched 2 record(s)

1. C05E2.12. EM:CEDAF4LAR,1

Entrez Results

Resulting Record Count:Query matched 1 record(s)

1. Caenprhabdinselexans dauer larva deyejojiment regulatory proteind̂af ̂ A\ mRN A, complete cds_.

r i

[I http://rohe.ncsa.uiuc.edu/cgi-bin/obj-server.pi?Entrez-Sequence-<X)4681

| 'Back|iFoward|iHome[-Reload]rOpen...|rSave"As...|rclone|rNewWindowj

As...|iClone|iNew Window|

Fig. 2. An ENQUire Query Session. The left screen shows the database selection form. Checkboxes are used to select databases. The center screen showsthe query generation form. The search types are calculated as the intersection of the search types available from all the databases. The right screen showsthe results page, which presents the names of the objects found by the search, grouped by database. Each entry is a hyperlink to the object serverprogram, which creates the query needed to fetch the desired object.

minor speed cost, allows for the dynamic addition oftranslation modules without the need for recompilation.Finally, perl is platform independent, allowing the gatewayto be installed on any UNIX system without recompilation.

Each translation module defines five required sub-routines and one optional subroutine. These are brieflydefined in Table I. Within the confines of the definedinterfaces, any programming approach for the routines issupported. Information regarding the authoring oftranslation modules can be found at http://csl.ncsa.uiuc.edu/ENQUire.

Implementation

A sample ENQUire query session is shown in Figure 2. Adatabase selection screen is presented to the user (left

screen), in which check boxes are selected. The inter-sections of the databases' return types and query types arecomputed, and a query-builder form is returned to the user(center screen). The user can build a complex query of upto seven terms, nested by parentheses and linked by eitherAND or OR. The search results are presented to the userin a unified format (right screen), with the source databaseand the links clearly defined.

The query federation process illustrated in Figure 2 is atthe syntactic (field names) level. Basic biological types aregiven common names, and the common names aretranslated into the proper terminology for a specificdatabase. The field name mapping is specific by database,and sometimes specific by position within the database.For example, within ACeDB, a person object is referred toby several field names depending upon the context

147

D.C.Jamison, B.Mills and B.Schatz

afltHfrTtVM W * "* MWrii dauar larva formationCritical pTioda in tha dtv»lop»«\t of cha C alaaatia dauat l

3: NCSA Mosaic: Document

fife Cfrfltou MtvfraW Annoti* New Help

Title:

URL: HJ^p://roh«.ncs*,uiue.«clu/c»i-b1n/obJ-»trver.plTAC«DB«

Cloning the daf-4 gene

Mlgutl Eltcvez

Department of BlolofM Sciences. Unlvenirv o( Missouri. Columbia, MO65211

FUrlec Albert

Department of Biolottcal Sciences, Udvently of Missouri. Columbia, MO65211

D . L. Riddle

Department of Bioloticsd Sciences. University of Missouri Columbia. M 0 S5211

At U well known. C elegant mult choose between two alternative L3 larval fates at the 1 ?. to L'molt If condltianl art favorable (ie food and space are abundant) the animal molts into an ]J3committed to becoming an adult within a d*y However, unfavorable environmental condition*Induce the developmentaQy arrested dauer larva Dauers can survive for more than a month aithey search for greener pastures in which to resume development toward adulthood Mutattomdefining the signal tratuducrjon pathway leading to dauer, or non-dauer development have beenidentified, and faD Into two categories Constitutive mutants form dauers at a restrictivetemperature regardless of the presence of food Dauer- defectives, on the other hand, WID not formdtuers even when starved These isolated mutant! and their Interactions have been extensivelyCharactertze4 (N ature vot290,p68o)Atthisume, analysis of dauer Induction at the molecularlevel is proceeding at breakneck speed daf-4_,a late-acting dauer- constitutive gene In the gencucpathway, confers several curious phenotypes Like other dauer-constitutrve mutants, daf_4 forms100% dauer! at 2S*C. In addition, daf^ exhibits a small phenotype, and has been reported byLewis Jacobson's lab to be defective In intestinal endecytosls. It is also interesting to note thatdaf-12 .a dauer' defective with a long phenotvpe, is epistatic to daf-4 'sconstituttve phenotype. butnot its small phenotype This might bnpty a role for the daf-4 gene product in pathways other than

ht tpj l rohencjau iuced i i / cg l -b ln iob l server pl?ACe()I!-Locus-daI-«BactJ i. -.\ u| Home| Reload| Open , | Save As . | Clone| New Window| Clo$e Win<iow|

Fig. 3. ACeDB Entry. The reference portion of the ACeDB entry for thegene daf-4, as retrieved by ENQUire from the WWW ACeDB server atUSDA/NAL. The server delivers a perl5 object, which is translated intoHTML with the links pointing at the ENQUire object server script

(e.g. 'author' within document objects, 'mapper' withinsequence objects).

Following links from the results summary will connectthe user with pages drawn from the original databases. Asample set of results pages is shown in Figures 3 and 4. InFigure 3, the daf-4 gene entry from ACeDB has beenretrieved and scrolled down to view the references. Eachlink (underlined text) in the entry points at an object serverspecific for that database, which creates the proper queryto retrieve and translate objects from that database. Theobject server can redirect requests, retrieving an objectfrom a more canonical source, creating a semantic (fieldvalue) level federation.

Figure 4 shows the result of retrieving a Worm Breeder'sGazette article from the ACeDB object server. A redirec-tion has occurred, and the HTML page from the WBGarchives at NCSA has been retrieved, as opposed to thenormal ACeDB object for the same item, shown inFigure 5. The ACeDB version of the text is nothyperlinked, and is presented as a database record ratherthan as a document. The WBG documents in the NCSAarchive are a better presentation of the data. While this

Fig. 4. HTML Version of a WBG Article. The object server script forACeDB redirected the WBG article request to the HTML version of theWBG at NCSA. The internal links in the hypertext document point at theACeDB object server, since all the links refer to C.elegans genes.

1.13]Options

Opfloru Annafafr Mr» •

[ c «

IWvranc* Tit I* ClomN)Journal I o n Btn

JtbtUMt Jybqll 3pJ«lIUF*t* to Lseu* tiff-4

QIMI y: ACEDB Ounf* 1 mfftfil

L««T«H:

<Vt l» **U known. C •l»fl«M au*t tboo«« kttnw tve tltamativ* 11U i n l fatat i l tin L! U L3 wilt If condition. u« f»rror»bl* U*food and tpac* »t« abmaant) th» MUMI MOIU UIU «n L3 csa>utt«d tobaeaftinq *n nhilt nttun t tey MOT«V«C wifavctabLa HwumwULcondiuon* uuluca th* 4*v*lefMnt*llT u i t . l id dauai lai-v* Daucitcan aurvin fee Mia UMTI a tenth a* thay »«Mck (ot graarai pavtuiatui «hicb to H U M &mlBpm**l t c w d adolUtood »ot*tnm» dafimng

Conatatuttw* BUtanta loitagaxdlett of the prate*

Pitied, HWI fal l Idauart at • lettr

i of food

nto two c«t»qcn

Dwti-dafictint. on tin othari t u n d %•*• tioliud wtant*

nvl th.it inuiaKtiatN* kava baon WTtantivalr duuKUmad Waeuf* wolMO. p (U) i t tint ti*.. analr*i* of dauat utducuon at UM•wlaculu l*v«l ti procaadina at bi«akn*cl< tpnidaf-4. • l»t«-«tiruj dtMMfcontt.itut.iv* 9M11 in Uta fmtUc paUwa*confan lavaral curiou* phanetyp** LiXo othar dauat'Con«titutt««•utanti. daf-4 fotai 1001 d*u«r» at IS C In Mldition. daf-4 wdubit*• M*11 phanatyix. and bai bear (apacUd by Lmi Jacobton* lah to b*defectiv* in intestinal tMocytxi*)* It la alao tnt*[«tttn9 to not*th*t daf-lS. • dauar-daficuva with a lanq phanatvp*. it •pitUUr toi*f-4 » coniLitutiv* phanotyp*. but net its wall phanotvp* Tfeittufjtt kBply a cola for ttm daf-4 gan* product in pathway* other thandauat induction•VlUtouffa o«muc lUtdiai hawa provtdad in»vght mU 4aJ-4'» ralatiw*

1 Hpme| flrtoadj Optn j Save As | Clone|

i*ig. s. ALeUH version of a "VviiU Article. 1 hese two screens illustrate thestandard ACeDB version of WBG articles and other text documents. Thefirst page shows bibliographic data, and the next shows the plain ASCIItext without any hyperlinks.

148

ENQUire: a network query unification system

example is somewhat superficial, the same simple semanticfederation can be used to follow sequence accessionnumbers between literature and sequence databases. Thenet result of redirecting the fetches is an apparent increasein the dataspace seen by the user.

The databases available in the initial release ofENQUire reflect both the WCS roots of ENQUire and asampling of important molecular biology databases. Fivedatabases are available: ACeDB and the WBG archive (fromthe WCS roots), as well as SolGenes, Entrez, and PIR.

ACeDB and SolGenes are organism specific genomedatabases based upon ACEDB (Durbin and Meig, 1991).ACeDB contains information about the nematode C.ele-gans, while SolGenes contains information about thegenus Solanaceae (potatoes, tomatoes, and peppers). Thedatabases are queried using the perl module Aceclient.pm,accessing the aceserver maintained at the NationalAgriculture Library, and results are formatted intoHTML using the perl module AceWWW.pm (Barnettand Bigwood, 1995).

Entrez federates several biological sequence databasesand a literature reference database (Boguski, 1994). Thesequence portion of Entrez contains sequences of multipleorganisms from Genbank, EMBL, Swiss-Prot, while thereferences portion consists of the molecular sequence datasubset of Medline. Entrez is accessed using nclever, acommand line interface to the network Entrez server(Rioux etal., 1994).

PIR (Protein Information Resource) is a collection ofmacromolecular sequences. PIR is accessed via the NBRFprogram ATLAS.

The WBG archive is a full-text and figures hypertextversion of the Worm Breeder's Gazette, and informalnewsletter for C.elegans researchers published by theCaenorhabditis Genetics Center. The articles are storedlocally at NCSA, and are indexed and searched via aGlimpse server (Manber and Wu, 1994).

Several of these data sources do not support complexqueries. The ACEDB-based databases and PIR supportoperations upon a single keyset. To allow the user tocreate complex queries involving boolean operations onmultiple sets, each atom (i.e. each target-field-termcombination) can be queried separately, and the resultsare combined logically by the ENQUire server.

Performance of the ENQUire system is acceptable. Thetime required to return a result was compared against thegeneral WWW ACeDB server at USDA/NAL and theWWW Entrez browser at NCBI. PIR and WBG have nocorresponding WWW servers. The ACeDB WWW serveraveraged 6 s for response to an author query, whileENQUire using the Aceclient.pm mechanism returnedresults in under two seconds. ENQUire queries usingnclever to query Entrez are slower than using the Entrez

WWW page directly, but this is probably due to verydifferent mechanisms for performing the query. In bothcases, performing a complex query is a multi-step process,and thus is not directly comparable.

A prototype of the ENQUire system was made availableto C.elegans researchers in early May of 1995. The responsefrom the community was enthusiastic. In the first 4 monthsfollowing the announcement, 387 search queries were posted,representing 128 sites. This represents researchers using thesystem, and does include curiosity inquiries.

The ENQUire system has been designed as a portablegateway, requiring only a forms-capable HTTP server anda Perl5 interpreter. Each of the communications modulesis configurable to point at any source, allowing forinstallation at mirror sites. The gateway code andinstallation documentation is available by anonymousftp from csl.ncsa.uiuc.edu in the /pub/ENQUire/Gatewaydirectory. Documentation for authoring translationlibraries is available both on the WWW at URL http://csl.ncsa.uiuc.edu/ENQUire and by anonymous ftp fromcsl.ncsa.uiuc.edu in the /pub/ENQUire directory.

Discussion

The goal of this research has been to create a virtualfederation of biological database resources across theInternet. The ENQUire system has been demonstrated tobe a successful query integration mechanism for hetero-geneous databases. A single query can gather results frommultiple data sources via disparate communication pro-tocols, and then present the results back to the user. Thespeed of making individual queries is not significantlyslower than querying the native database, and since theENQUire queries are done in parallel there is a great timesavings when querying multiple databases.

The ENQUire system is fully scalable and distributed.Additions to the database translators is simple, in mostcases taking just a few days to add a database to thesystem. The only requirement for new databases is thatsome method of remote access exist. The databases do notneed to be mirrored in any way, although it is anticipatedthat some of the major databases will be mirrored locally.Addition of databases is a facile operation, requiring onlya few days to write a new translator from scratch, or a fewhours to adapt an extant translator for a similar database.The ease of database addition distinguishes the ENQUiresystem from other database federation systems.

The system is straightforward, translating from ageneric query language into database specific querylanguages, then translating the results into a genericHTML format. The simplicity of the approach does havesome inherent limitations. One limitation (inherited fromthe databases themselves) is found when querying

149

D.C.Jamison, B.Mills and B.Schatz

databases with a browsing language (like ACEDB). Whilethe object-oriented ACEDB browsing language is wellsuited to an interactive session, to perform a single queryrequires building a string of browser commands to follownon-reciprocal links between objects, making a full searchimplementation difficult. To illustrate this using the ACeDBmodel, one can find the person who performed a geneticexperiment by following a direct link from a genetic crossdata object to a person object (find twofactor 'x'; followmapper), but the reverse search to find all the geneticexperiments performed by a person doesn't work, since thereverse link from a person to their associated geneticmapping data doesn't exist.

Another limitation in the ENQUire system (anddatabase federation in general) is the diversity ofnomenclature between organisms. Since gene namesdiffer between species, searching the Solanaceae databasefor a specific Caenorhabditis gene name is non-productive,even though both databases contain genes which arefunctionally equivalent. While this is a biological seman-tics problem and is out of the current scope of research, itdoes represent a significant hurdle to query federationsystems. However, the query mechanisms of the ENQUiresystem is flexible enough that when research projects likethe Digital Library Initiative (Schatz et al., in preparation)achieve a true semantic database federation (i.e. federationof concepts rather than characters) the methodology canbe adapted into the core framework.

The ENQUire system was conceived as a data browsingsystem. This means that in some cases detailed informa-tion cannot be accessed directly via the initial query. Thisis highlighted by the limited number of potential returntypes. In some cases, as with ACEDB, the return classeswere dictated by the mechanism of making queries. Inother cases, the limitation comes from the specific type ofdatabase represented, e.g. the limitation of the WBGdatabase to returning only documents, while being search-able by any number of criteria. For all the databases, thenative query/browsing language is richer in context than thegeneric one allowed in the ENQUire system. This suggeststhat the proper use of ENQUire is as a data discovery tool,to look up general queries and then to refer to the specificdatabase(s) for detailed information.

The ENQUire system is a model for distributed queryfederation environments. Future expansion of the systemwill include new database translators, the addition of adata cache and the introduction of a session concept,allowing the user to retain and reference information frommultiple searches. The data cache will hold information ina standardized data interchange format, allowing sub-sections of the data to be easily extracted.

Another potential for this work is the addition oftranslation libraries to handle submission of results toprograms such as multiple alignment, parsimony, andstructural analysis programs. The fusion of analysis toolsto a distributed query system will result in a WWW-basedanalysis environment, wherein the researcher can retrievedata from many heterogeneous databases, view it, andmanipulate it, using either their own local computers orcomputational resources at NCSA and other large-scalecomputing centers.

Acknowledgements

This work was supported in part by NSF grant BIR19844 from theDatabase Activities program in the Division of Biological Instrumenta-tion and Resources. DCJ is supported by a NSF CDA-CISEpostdoctoral fellowship. The authors wish to thank Doug Bigwood(USDA-NAL) and David George (NBRF) for their cooperation inaccessing databases, and Shankar Subramaniam (NCSA) for his criticalreading of the manuscript.

References

Akiyama,Y. and Kanehisa.M. (1995) Introduction to the DatabaseServices on the GenomeNet. Exp. Med., 13 (in Japanese). See alsoURL http://www.genome.ad.jp/dbget/dbget.links.html

Barnett,J.D. and Bigwood.D. (1995) ACEDB/Perl Library. URL h u p / /probe. nalusda.gov:8000/tools/acelib. html

Bleasby,A.J. and Wooten.J.C. (1990) Construction of validated, non-redundant composite protein sequence databases. Protein Eng., 3,153-159.

Boguski.M.S. (1994) Bioinformatics. Current Opinion in Genetics andDevelopment, 4, 383-388.

Durbin,R and Thierry-Mieg.J. (1991). A C.elegans Database. Docu-mentation, code, and data available from anonymous ftp servers atlirmm.lirmm.fr, cele.mrc-lmb.cam.ac.uk, and ncbi.nlm.nih.gov.

Etzold.T. and Argos.P. (1993) SRS—an indexing and retrieval tool forflat file data libraries. Comput. Applic. Biosci., 9, 49-57.

Fleischmann,R.D et al. (1995) Whole-genome random sequencing andassembly of Haemophilus influenzae Rd. Science, 269, 496-512.

Levy.J. (1994) Sequencing the yeast genome: An internationalachievement. Yeast, 10, 1689-1706.

Manber,U. and Wu,S. (1994) Glimpse: A tool to search through entirefile systems. 1994 Winter USENIXT Conference.

Olson,M.V. (1993) The human genome project. Proc. Natl Acad. Set.USA, 10, 4338-4344

Rioux.P.A., Gilbert,W.A. and Littlejohn, T.G. (1994) A portable searchengine and browser for the Entrez database. J. Comp. Biol., 1, 293-295.

Ritter,O. (1994) The Integrated Genome Database. In Suhai.S. (ed),Computational Methods in Genome Research. Plenum Press, NewYork, pp. 57-53.

Shoman,L.M., Grossman,E., Powell,K., Jamison,C. and Schalz.B.R.(1995) The Worm Community System, Release 2.0 (WCSr2). InShakes,D. (ed), C.elegans: Modern Biological Analysis, Met. Cell Bio.48. Academic Press, New York, pp. 607-625.

Stein,L. (1995) CGl.pm—a Perl5 CGI Library. URL http://www-genome.wi. mit.edu/ftp/pub/software/WWW

Wilson,R. et al. (1994) 2.2 MB of contiguous nucleotide sequence fromchromosome II of C.elegans. Nature, 368, 32-38

Wall,L. and Schwartz.R.L. (1991) Programming perl. O'Reilly &Associates, Inc. Sebastopol, CA, USA.

Received on November 16, 1995, accepted on January 8, 1996

150