26233967-bioinformatics

8
 Bioinformatics David A Adler, Zymo Genetics Inc. and Department of Pathology, University of Washington, Seattle, Washington, USA Darrell Conklin, Zymo Genetics Inc., Seattle, Washington, USA Bioinformatics is a discipline at the intersection of computer science, information technology, mathematics and biology, and includes the study and practice of archiving, searching, displaying, manipulating and modelling biological data. Bioinformatics is applied, for example, in the construction of genetic and physical maps of genomes; nucl eoti de andaminoacid sequence ana lys is;gene dis cove ry andthe pre dic tio n of pro tein structures. Introduction Advances in classical as well as modern biology have often been achieved by ind ivi duals present ing novel per specti ves of previ ously avail able observ ationa l information. The elucidation of the geneti c code following the publication of Wat son and Cri ck’ s model for the stru cture and replic ati on of DNA, along with the subsequent codication of the central dogma of molec ular biology (DNA is transcr ibed into RNA whic h in turn is transl at ed into prot ei n) exempli fy the concept of biomol ecul es as inf ormati on carriers. This view leads naturally to the application of computational approaches to the analysis of DNA and protein sequence. In addition, the development of high- through put techno logie s for generat ing biolo gical and bioche mic al data has contri buted to a dat a explos ion , thereby increasing the diculty of simply examining all data perti nent to a bi ol ogical questi on. The need to retrieve, organize and digest very large databases requires the development of computational tools for data interac- tion and analys is. Bio inf ormati cs is a dis cip lin e at the intersection of computer science, information technology, mat hemati cs and biolog y and inc ludes the stud y and practice of archiving, searching, displaying, manipulating and model ling biologica l data. Bioi nforma tics research and development not only provides discovery tools for other biologists but is making direct intellectual contribu- tions to biology and medicine. Bio inf ormati cs is alternatively refe rred to as biocomput- ing or computational biology, the choice of term depend- ing on the focus of activity. The practitioner may have a background emphasizing any of the composite elds of study and it is only recently that colleges and universities have develo ped interd iscipl inary programmes with the goal of training bioinformatics professionals. An essential part of the infrastructure of bioinformatics is a commu- nications medium with fast data transfer rates and high trac capacity to provide almost simultaneous informa- tion access to thousands of people. In the late twentieth cent ury the int ernet beca me that med ium, and the pri nci ple interface is now the web browser. In just a few years the World Wide Web has taken over the majority of internet trac and the web’s hypertext, hypergraphic presentation and inter face has become the primary means of exp lori ng a vast knowledge base of biological information. The int ernati onal HumanGen omeProjec t, with the goal of det ermini ng the thre e bil lio n basepairs of human geneti c inf ormati on, has res ult ed in an expone nti al growth of genetic data and is driving the rapid growth and matura- tion of the disciplin e of bioinforma tics. Delving from the level of amino acid sequence encoded in a gene, to protein structure and its associated function touches upon central questions in biology. Computational approaches are now makin g impor tant contri bution s to our unders tandin g of the relations hip of the structure and functi on of biomo le- cul es and the ir roles in bio log ical process es. This articlewil l introduce some of the concepts, methods and tools of biocomputing and describe a few examples of areas of inquiry in bioinformatics. Scope of Bioinformatics Bioinformatics encompasses the study of a broad range of bi ol ogic al data in cl udin g ge ne maps, ge ne and pr ot ei n se quences and ge ne expr essi on pr ole s. A pri mary goal of thi s dat a analys is is dir ecte d towards unrave lli ng the inf ormati on content of biomol ecules and understand ing how ‘bioin formati on’ directs the development and functi on of li vi ng organi sms. The anal ysis of nucl ei c acid seq uence, pro tein str ucture/ functi on relati onship s, genome organi zation , regula tion of gene express ion , int eracti on of protei ns and mec han isms of physiological functions, can all benet from a bioinfor- matic s approa ch. Nucleic acid and protein sequence data from many dierent species and from population sam- plings provides a foundation for studies leading to new understandings of evolution and the natural history of humans. Article Contents Secondary article .  Introduction .  Scope of Bioinformatics .  Hardware .  Software .  Mapping and L inkage Analys is .  Biosequence Analysis .  Conclusion 1 ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net

Upload: keri-gobin

Post on 07-Oct-2015

215 views

Category:

Documents


0 download

DESCRIPTION

io[

TRANSCRIPT

  • BioinformaticsDavid A Adler, Zymo Genetics Inc. and Department of Pathology, University of Washington,Seattle, Washington, USA

    Darrell Conklin, Zymo Genetics Inc., Seattle, Washington, USA

    Bioinformatics is a discipline at the intersection of computer science, information

    technology, mathematics and biology, and includes the study and practice of archiving,

    searching, displaying, manipulating and modelling biological data. Bioinformatics is

    applied, for example, in the construction of genetic and physical maps of genomes;

    nucleotide and amino acid sequence analysis; gene discovery and the prediction of protein

    structures.

    Introduction

    Advances in classical as well as modern biology have oftenbeen achieved by individuals presenting novel perspectivesof previously available observational information. Theelucidation of the genetic code following the publication ofWatson andCricksmodel for the structure and replicationof DNA, along with the subsequent codication of thecentral dogma of molecular biology (DNA is transcribedinto RNA which in turn is translated into protein)exemplify the concept of biomolecules as informationcarriers. This view leads naturally to the application ofcomputational approaches to the analysis of DNA andprotein sequence. In addition, the development of high-throughput technologies for generating biological andbiochemical data has contributed to a data explosion,thereby increasing the diculty of simply examining alldata pertinent to a biological question. The need toretrieve, organize and digest very large databases requiresthe development of computational tools for data interac-tion and analysis. Bioinformatics is a discipline at theintersection of computer science, information technology,mathematics and biology and includes the study andpractice of archiving, searching, displaying, manipulatingand modelling biological data. Bioinformatics researchand development not only provides discovery tools forother biologists but is making direct intellectual contribu-tions to biology and medicine.Bioinformatics is alternatively referred to as biocomput-

    ing or computational biology, the choice of term depend-ing on the focus of activity. The practitioner may have abackground emphasizing any of the composite elds ofstudy and it is only recently that colleges and universitieshave developed interdisciplinary programmes with thegoal of training bioinformatics professionals. An essentialpart of the infrastructure of bioinformatics is a commu-nications medium with fast data transfer rates and hightrac capacity to provide almost simultaneous informa-tion access to thousands of people. In the late twentiethcentury the internet became thatmedium, and the principle

    interface is now the web browser. In just a few years theWorld Wide Web has taken over the majority of internettrac and the webs hypertext, hypergraphic presentationand interface has become the primarymeans of exploring avast knowledge base of biological information.The internationalHumanGenomeProject,with the goal

    of determining the three billionbase pairs of humangeneticinformation, has resulted in an exponential growth ofgenetic data and is driving the rapid growth and matura-tion of the discipline of bioinformatics. Delving from thelevel of amino acid sequence encoded in a gene, to proteinstructure and its associated function touches upon centralquestions in biology. Computational approaches are nowmaking important contributions to our understanding ofthe relationship of the structure and function of biomole-cules and their roles in biological processes. This articlewillintroduce some of the concepts, methods and tools ofbiocomputing and describe a few examples of areas ofinquiry in bioinformatics.

    Scope of Bioinformatics

    Bioinformatics encompasses the study of a broad rangeof biological data including gene maps, gene andprotein sequences and gene expression proles. Aprimary goal of this data analysis is directed towardsunravelling the information content of biomoleculesand understanding how bioinformation directs thedevelopment and function of living organisms. Theanalysis of nucleic acid sequence, protein structure/function relationships, genome organization, regulationof gene expression, interactionof proteins andmechanismsof physiological functions, can all benet from a bioinfor-matics approach. Nucleic acid and protein sequence datafrom many dierent species and from population sam-plings provides a foundation for studies leading to newunderstandings of evolution and the natural history ofhumans.

    Article Contents

    Secondary article

    . Introduction

    . Scope of Bioinformatics

    . Hardware

    . Software

    . Mapping and Linkage Analysis

    . Biosequence Analysis

    . Conclusion

    1ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net

  • Access to the published scientic literature is anotherimportant aspect of bioinformatics and in the realms ofbiotechnology and biomedicine in particular, this includesinvention information contained in patent archives.Library professionals are now more likely to be foundsearching digital databases than the shelves and havebecome digital information access experts using a varietyof sophisticated search interfaces. One of the largestdatabases in the world is Medline, a vast archive ofbiological and biomedical journal references that coversthe period from 1965 to the present. Medlines indicesprovide rapid searching of title, author, keywords andterms, citation and often abstracts of articles. Recently, animportant technological promise has begun to be fullledby delivering, almost instantly, to the desktop, journalarticles, complete with full text and high-resolutiongraphics.Bioinformatics researchers, computer scientists and

    information specialists are also working on the conceptualfoundations for the next generation of knowledge naviga-tors. New hardware and software developments will breakout beyond the limitations of our current windows on theworld of biology and provide tools of discovery in the nextcentury. The technological innovations in bioinformaticsare accompanied by ethical concerns, particularly pertain-ing to data repositories of identiable genetic information.The ethical implications of advances in bioinformaticsnecessitate investigation and eorts on the part ofscientists, lawmakers and the public are required to ensurethe privacy of individuals.

    Hardware

    The computer is the basic tool of bioinformatics, utilized tostore, display andanalyse data, and todesignand constructscientic models and simulations. Computer hardwarerequirements are commonly dictated by the tasks needingto be accomplished, the software available to do the job,the computational intensity of the process and the degreeof interactivity desired.Modernpersonal computers have ahigher performance than the super computers of twodecades ago so that sophisticated programs can be run andcomplex, interactive software is on the desktop. For jobsthat require more computational capacity such as rapidsearches of very large databases, protein modelling, three-dimensional display and simulating the interaction of largemolecules, it may be necessary to employ current super-computer class machines, providing high performance byharnessing multiple central processing units (CPUs).Hardware solutions (algorithms in silicon) designed toperforma single computational task, such as extremely fastsearching, have also been developed. However, the highcost of custom hardware for specic computational taskshas limited their widespread application.

    Software

    Software for bioinformatics is as task-driven as thehardware. Data descriptions, the types of searches andanalysis, how one needs to interact with the computer(interface), and how the results are presented, all willdetermine the choices of software for the task at hand.Programs for the analysis of genetic and physical mappingdata, drawing pedigrees and evolutionary trees areavailable from both commercial and academic sources.Sequence analysis suites generally include programs forassembling sequences, pattern or string searching, restric-tion analysis, motif identication, base or amino acidcomposition analysis and protein characterization. Thereare also individual programs for particular tasks such asmultiple sequence analysis, for example, Clustal W (seeTable 1), and for similarity searches of database such asBLAST (Table 1). Institutions and schools often haveobtained site-licenses for software packages, which arethen made accessible for use on networked desktopcomputers. Software for a wide variety of computationaltasks, which have been developed at academic institutions,is often freely available for download via the internet.Unfortunately sites come and go on the internet so it isdicult tomaintain lists of resources with associated links.Instead of presenting a comprehensive list that will becomeobsolete almost instantly, the reader is referred to severalstable, well-maintained sites as starting points for ndingbiology-related software and documentation on the inter-net (Table 1).A web browser has become another necessary bioinfor-

    matics tool, since the web medium is often the easiestmeans of accessing data from remote networked data-bases. Particularly for very rapidly growing map andsequence databases, it is not practical or appropriate to trytomaintain a local copy of the data. To ensure the accuracyand timeliness of information from databases that areupdated daily it is necessary to be able to access those sitesdirectly. The search interfaces provided for interactingwith the major data repositories are powerful and fast,delivering responses within seconds. Network server soft-ware often report results in hypertext format, facilitatingthe further investigation of details and related informationfrom other databases. Regardless of the particular soft-ware one chooses for a task it is important to know theprogram well enough to use it eciently, to maximize itsutility and to evaluate the signicance of computationalresults.

    Mapping and Linkage Analysis

    The three billion base pairs comprising the human genomeare distributed among 24 dierent linear DNA moleculeswhich in turn are packaged into individual chromosomes

    Bioinformatics

    2 ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net

  • (autosomes 122 and the sex chromosomes, X andY). It ispredicted, very approximately, that there are 100 000 genesand each has a distinct location on one of the humanchromosomes. The process of assigning genes and DNAfragments to locations onparticular chromosomes is calledmapping.Gene maps are of two primary types, genetic and

    physical. Genetic maps, determined by family studies inhumans and dened crosses of laboratory organisms suchas mice, provide the chromosomal assignment of a geneand its position relative to other genetic markers. Syntenyis the term for two genes being on the same chromosomeand they are linked if they are suciently close on thechromosome that they appear to be inherited together. Thedetermination of genetic map distance between genes is

    referred to as linkage analysis. Linkage can only bedetermined for polymorphic genes, those that have twoor more distinguishable alleles in a population. Geneticmap distances are calculated by assessing the frequency ofrecombination between two polymorphic loci on achromosome and are usually expressed in units calledmorgans (0.01morgans, or one centimorgan (cM), is equalto 1.0% recombination). For a simple two-point cross, thelongest genetic distance that can be measured is 50 cM, or50% recombination, since genes further apart will appearto act like unlinked genes. However, by combining datafrom multiple loci, complete chromosome maps can bededuced. A general principle of gene mapping is that thecloser two loci are, the less likely a recombinational event,or breakage in the case of several physical-mapping

    Table 1 Starting points for exploring bioinformatics on the web

    Site Description URLsClustal Multiple sequence alignment

    software (about ClustalW)http://bioinformer.ebi.ac.uk/newsletter/archives/

    2/clustalw17.htmlBLAST servers:

    NCBI (USA)EMBL (Germany)

    Sequence database similaritysearching http://www.ncbi.nlm.nih.gov/blast

    http://dove.embl-heidelberg.de/Blast2NIH software Software repository and

    well-maintained link listshttp://molbio.info.nih.gov/molbio/

    software.htmlCMS Molecular Biology Resource http://www.sdsc.edu/ResTools/cmshp.htmlEXPASY SIB http://www.expasy.ch/Weizman Institute of Science http://bioinformatics.weizmann.ac.il/mb/

    software.htmlBiology Department

    Indiana University http://www.bio.indiana.edu/generalinfo/

    bioresearch.htmlThe Laboratory of Statistical Genetics

    Rockefeller UniversityGenetic mapping background and

    resources

    http://linkage.rockefeller.edu

    WIBR Mapmaker Mapping software distribution http://waldo.wi.mit.edu/ftp/distribution/software/mapmaker3

    Centre dEtude du Polymorphisme Humain (CEPH)

    Human mapping resources http://www.cephb.fr

    Mouse Genome Database Jackson Laboratory

    Mouse mapping and informatics http://www.informatics.jax.org

    EUCIB European Collaborative Interspecific Backcross mouse genetic mapping

    http://www.hgmp.mrc.ac.uk/MBx/MBxHomepage.html

    Radiation Mapping EBI Stanford

    Radiation hybrid mapping tools and resources

    http://www.ebi.ac.uk/RHdbhttp://waldo.wi.mit.edu/ftp/distribution/software/rhmapper

    Genome Database Human gene mapping database http://www.gdb.orgOMIM Catalogue of human genes and genetic

    disordershttp://www.ncbi.nlm.nih.gov/omim

    PDB Protein Data Bank protein structure resource

    http://www.rcsb.org/pdb

    MapManager Software suite for genetic mapping projects

    http://mcbio.med.buffalo.edu/mapmgr.html

    Bioinformatics

    3ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net

  • methods, will occur between them. Since genetic mappingis dependent upon probabilistic phenomena, statisticalmethods are necessary for the calculation of map distancesbased on actual observations of inheritance. The signi-cance of linkage data is typically reported as a predictionprobability and expressed as a logarithmic odds ratio orLOD (logarithm of dierences) score. As a practical andgeneral rule a LOD score of 3 or greater between twogenetic markers is considered signicant evidence forlinkage. Examples of software for the analysis of geneticlinkage data and the calculation of linkage and mappingdistances are available (see, for example, The Laboratoryof Statistical Genetics at Rockefeller University, andWIBRMapmaker; Table 1).It is obviously not practical to control matings in human

    populations, sohumangeneticmaps canonlybe elucidatedby following the segregation of traits, or genetic markers,in family studies. Such methods have been widely used inthe assignment of inherited disease loci to particularchromosomes and chromosomal regions but usuallyrequire pedigrees of large families with multiple genera-tions. Investigators utilizing samples from large familyresources, such as at theCentre dEtudeduPolymorphismeHumain (CEPH; see Table 1) have made major contribu-tions to the present density of the human genetic map. Theadvent of recombinant nucleic acid technology with theability to clone and visualize particular small fragments ofDNA, and the identication of simple sequence repeatpolymorphisms in human populations has further ex-panded the analysis of the large family repositories,thereby contributing to the density of human geneticmaps.In the laboratory mouse, with a long history of genetic

    analysis, geneticmapshave been constructedover the yearsby following the segregation of alleles in experimentalmatings betweenwell-characterized inbred strains.Geneticheterogeneity of 0.31.0% between the laboratory mouseand the interfertile species,Mus spretus, has been exploitedto produce high-resolution mouse genetic maps. MapMa-nager (Table 1) is a software package for tracking results ofgenetic crosses, calculating map distances and generatingchromosome maps. Starting points for exploring mousegene mapping data are:Mouse Genome Informatics, fromJackson Labs, and EUCIB (European CollaborativeInterspecic Mouse Backcross); see Table 1.Physical maps exploit techniques of molecular and

    cellular biology to localize genes and other markerswithout the need for family studies or genetic crosses,anddonot require polymorphic genes.Methods of somaticcell genetics,DNAhybridization and the polymerase chainreaction (PCR) have provided a streamlined approach tohuman gene mapping. Cytogenetic techniques combinedwith nucleic acid hybridization provide the only directmeans of localizing genes on chromosomes. Nucleic acidprobes, labelled with uorescent dyes, are allowed to bindto their complementary sequences on spread chromosomesand are detected by uorescence microscopy. A trained

    cytogeneticist, who is able to visually identify eachindividual human chromosome, must assess the preciseposition of uorescent label. This technique, uorescencein situ hybridization, or FISH, usually applied tocondensed metaphase chromosomes, has now been ex-tended to interphase chromatin. Interphase FISH usingtwo probes dierentially labelled has been used tomeasuredistances between genes in the range of 2MB to 50kilobases.Other than FISH, physical mapping methods localize

    genes relative to previously mapped markers. The in-cidence of random breakage events between markers andthe occurrence of concordant cloning of two genes are bothused to estimate physical distance between loci. Humancells, irradiated with X-rays to induce chromosomebreakage, can be fused with normal rodent (mouse orhamster) producing a hybrid cell line. Each individual cellline will retain only one or a few fragments of humanchromosomal material. The presence of human DNA isusually detected by PCR assay of an STS (sequence taggedsite: a fragment of genomic DNA that can be uniquelyamplied). The detection of two STSs in the same hybridcell line is evidence that two associated DNA fragmentsreside on the same chromosomal fragment. Statisticalanalysis of data from assaying many dierent hybrid celllines can estimate the distance between genes and with asucient number of cell lines, a map of the human genomecan be generated. Once the map framework is created, thepanel of radiation hybrid cell lines can be used to map newloci. New genes aremapped by assaying each cell line in thepanel for the presence of the new locus (an STS) and thenevaluating the concordance of PCR-positive cell lines withprevious mapping data. Information on the Genebridgepanel (93 cell lines) can be found at EBI University andStanford University (Table 1). Software for generatingmaps from radiation hybrid scoring data is available fromWIBR (Table 1).Content mapping is based on recombinant techniques

    used to clone DNA fragments into various vectors. Thevariety of vectors can be grouped by the size of foreignDNA insert they can carry. Commonly used cloningvectors include phage, plasmids, cosmids, BACs (bacterialarticial chromosome) andYACs (yeast articial chromo-somes). If genomic DNA is randomly broken intoappropriately sized pieces and then packaged into one ofthese vectors then individual clones can be analysed todetermine map distance relationships. If, for example,PCR assays for two genes are both positive in a singlerecombinant clone then the two genes must be no furtherapart than the size of the insert DNA. Thus the resolutionof this type of STS content mapping is dependent on theparticular vector chosen. Collections of characterizedcloned material are also amenable to creating overlappingcontigs along an entire chromosome (see, for example,Foote et al., 1992). Reconciliation and integration of mapsderivedbydierentmethods, particularly the combiningof

    Bioinformatics

    4 ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net

  • physical and genetic maps, contribute to increasing theaccuracy and resolution of mapping data. The correspon-dence of dierent map units can also be approximatedfrom integrated maps, for example 1 cM is roughlyequivalent to 12 megabases of DNA and similar in sizeto a small cytogenetic band. Examples of integrated mapscan be searched and viewed: CEPH-Genethon Integratedmap, as in Table 1. Comparisons of the gene maps ofdierent species have proven valuable in evolutionarystudies as well as identication of human disease genes.The functional signicance of the conservation of genomearrangement as evidenced by, for example, the homeoticgenes maintained evolutionarily as clusters, a conservedlinkage from fruities to humans, remains unclear.Thedevelopment, renement andapplicationof all these

    mapping technologies have produced dense maps of entiregenomes. Human gene maps of individual chromosomesthat could once be reported in graphic form on a singlesheet of paper can now only be displayed with computertechnology due to the exponential increase in the numberof localized markers. The availability of dense gene mapsalso greatly facilitates positional cloning of disease loci.Positional cloning refers to a commonly used strategy thatstarts from knowing only the approximate location of agene and progressively narrowing the critical region untilmutations in a single gene are shown to be associated withthe phenotype. There are many examples of the cloning ofhuman inherited diseases using this approach, includingHuntington disease, Duchenne muscular dystrophy andcystic brosis. Densemaps also provide the foundation forthe realization of the ultimate map, the complete genomesequence. In order to ensure the value and accessibility ofmapping data it is essential to maintain authoritativerepositories. The ability to search and display thisinformation is essential and is thus another importantaspect of bioinformatics.

    Biosequence Analysis

    Development of the technologies to determine the linearsequence of amino acids in proteins and the nucleotides inDNA and RNA leads to the requisite need for compilingand analysing sequence data. Sequence analysis is theprocess of investigating the information content of linearraw nucleic and protein sequence data.

    Nucleic acid sequence analysis

    The bulk of genomic DNA does not code for protein, andthe protein-coding regions of human genes are notcontiguous but are arranged with exons interspersed withintrons. Therefore an important question for computa-tional biology is how to detect protein-coding regionswithin genomic DNA. Other common tasks include

    translating DNA into protein, assembling partially over-lapping fragments, analysing sequences, comparing se-quences, and DNA motif discovery and recognition.Current DNA sequencing technologies are not capable ofgenerating complete sequence for long nucleic acidmolecules in a single sequencing run and so it is necessaryto utilize computational methods to assemble contiguoussequences from individual short sequence determinations.If a large DNA molecule is randomly broken into smallerpieces for the actual sequence determinations then acontiguous linear sequence can be reconstructed byaligning the overlapping portions from dierent randomfragments.A common question arising when new genes are cloned

    and sequenced is whether the sequence is already known ordoes not occur in current databases. Answering thisquestion requires comparing the newly obtained sequenceto every sequence in the database. The algorithm of choicefor this task is the extremely rapid BLASTN algorithm(Altschul et al., 1990). A list of all W-mers (contiguousfragments of length W, which is typically set between 11and 16), in the query sequence is rst compiled and thenevery sequence in the database is in turn checked againstthis list. This can be done rapidly and serves to rule outmost sequences from consideration. These regions are thenextended in either direction, using less stringent matching,to form HSPs (high-scoring segment pairs). The expecta-tion value of the HSP (the probability that an HSP of asimilar score will occur between two random sequences) iscomputed and all database sequences having signicantHSPs are reported. Overall database access time byBLASTN is minimized by using a compressed form ofthe nucleotide data and by using a memory-mapped le. Itis an algorithm highly amenable to parallelism and can becompiled to run on multiprocessor hardware.

    Amino acid sequence analysis

    Linear chains of amino acids, proteins, the product of genetranslation, are found in cells folded into functionallyactive structures. It is thought that the primary sequence ofthe protein determines the ultimate conformation of theprotein and therefore its biological function. However, theexibility of long-chain polypeptides can generate aninnite number of shapes and the computational task ofpredicting correct structures is beyond the reach of currentknowledge. Predicting the shape of a protein from its linearamino acid sequence is one of the holy grails ofcomputational biology. Solving the protein-folding pro-blem holds the promise of spawning major advances inmolecular biology, pharmacology and the treatment ofdisease.An indispensable resource for the bioinformatics

    scientist is the Protein Data Bank (Table 1), which is arepository of solved protein structures, that is,mappings of

    Bioinformatics

    5ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net

  • each atom in a protein onto three-dimensional space. Thisis done usingX-ray crystallography; the diraction patternof molecular crystals is interpreted to create maps ofelectron density. This technique is very time consumingand requires the availability of protein crystals. For thisreason, the number of known protein sequences vastlyexceeds the number of sequences with solved structures inthe PDB.

    Secondary structure prediction

    In a classic paper, Levitt and Chothia (1976) proposed aclassication of protein structures into four dierentstructural classes: alpha, beta, alpha/beta, and alpha1-beta. Although there is a predictive relationship betweenthe amino acid composition of a protein and its class(Chou, 1995), a prediction of protein class is in mostcircumstances too broad to be of general use.Knowing, forexample, that a protein is in the alpha class does notdistinguish it froma globin, an annexin or an interferon, allalpha class proteins with dierent topologies (numbers,lengths and connectivities of helices) and therefore verydierent biochemical functions. The computational tech-nique that can provide the scientist with information onprotein topology is called secondary structure prediction.A secondary structure prediction is simply an assignmentof a secondary structure state to every amino acid of thequery protein sequence.Most secondary structure prediction algorithms derive

    their models using proteins in the PDB, modellingobserved relationships between short pieces of contiguoussequence and secondary structure. Such relationships arenot specic; that is, near identical short stretches ofsequence can have dierent secondary structures. Further-more, the PDB is not large enough to contain sucientstatistics on longer,more specic stretches of sequence thatmaybe encountered in aquery sequence.Currently thebestsecondary-structure prediction algorithms (Rost andSander, 1993) achieve an accuracy of only about 70%(the number of secondary structure states correctlyassigned divided by the number of amino acids in theprotein). It is widely agreed that this gure is close to anupper bound on current methods, because none of themcan adequately model long-range interactions in theprotein sequence.

    Comparative modelling

    Through the ages the human genome has been the target ofmajor evolutionary processes such as gene duplication,gene fusion, gene rearrangement and gene deletion. Theindividual gene has been subjected to the more subtleprocess of base mutations that often change the proteinsequence of the gene product. Genes have evolvedsubstantially while still preserving the three-dimensional

    structure of their protein. This is because mutations thatsubstantially alter a protein fold will destroy the normalfunction of the protein, and will not persist throughgenerations. Furthermore, amino acids with similarhydropathic properties can often be substituted for oneanother in a protein without appreciably changing itsstructure.Therefore, the rst step in predicting the fold of a new

    protein is to determine whether it is evolutionarily relatedto some sequence in thePDB.The technique for testing twoprotein sequences for an evolutionary relationship ispairwise alignment using a dynamic programming algo-rithm (Needleman and Wunsch, 1970; Smith and Water-man, 1981). It has become abundantly clear that when twosequences have a high percentage of identical amino acidsat aligned positions, they will tend to have very similarfolds. Sander and Schneider (1991) quantied the relation-ship between percentage identity, alignment length andstructural similarity by studying proteins of knownstructure in the PDB. Roughly stated, when a pair ofsequences has at least 25% identity over at least 80 aminoacids, there is a highprobability that the sequences have thesame structure in the aligned region. It must be stressedthat the converse of this implication is not true: due toevolutionary divergence, two related sequences may nothave a signicant alignment. There may be a highprobability of nding an alignment of equal or higherscore between two random sequences.Dynamic programming algorithms, unless implemented

    in massively parallel hardware, are too slow for interactiveapplication to very large databases. This is because theyrequire, for every database sequence, the computation ofevery cell in a score matrix, the total number of cells beingequal to the product of the query and subject sequencelengths. Several clever algorithms have been devised toavoid the computationof the full scorematrix; twopopularmethods are FASTA and BLASTP. FASTA initiallycomputes a hash table containing all k-tuples (peptide oflength k) in the query sequence. A target sequence can betested very rapidly for the presence of these k-tuples. Thesecond step of the FASTA algorithm performs a limitedcomputation of the full score matrix, only in regions whichjoin selected k-tuples. The BLASTP method initiallycompiles a nite state machine, employing an extremelyrapid technique from computer science for ndingcommon substrings in sequences. Using an amino acidcomparisonmatrix allW-mer peptides (W is typically set at2 or 3) that could possibly attain a score greater than athreshold score to anyW-mer in the query are placed intothe machine. The value of this threshold depends on thecomparison matrix and on other parameters of thealgorithm.Each regionof a database sequence that reachesthe nal state of the machine is then extended in eitherdirection to form HSPs. Sequences with statisticallysignicant HSPs are reported to the scientist. Recentextensions to BLASTP permitting gaps in the alignment

    Bioinformatics

    6 ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net

  • now make BLASTP and FASTA nearly indistinguishablein terms of search sensitivity.Protein sequences related by ancient evolutionary events

    of gene duplication are said to form a family of sequences.To accurately delineate a protein family, access to allancestral sequenceswould be needed,which is not possible.However, it is a convenient fact that protein families aretransitive; if sequence A is related to B, and B is related toC, it can be inferred that A is related to C, even though Aand C might not have a signicant alignment. This fact isapplied routinely to infer the structure of new proteinsequences.Comparativemodelling encompassesmore than the task

    of nding a structurally similar protein. After backbonecoordinates of ungapped regions of the alignment aretransferred onto the target sequence, a full atomicmodel isdeveloped. This involves assigning coordinates to gapped,loop regions, and assigning coordinates to residue side-chains. The rough model can be improved by energyminimization techniques.

    Sequence motifs

    Even if a protein family is divergent, it may be possibleto identify short regions that appear to have conservedsequences and therefore locally conserved structure andperhaps biochemical function. Each region can bedescribed using a motif that states, for each position,the allowed variation in possible amino acids using adistinct score for each. Dynamic programming algorithmsare used to align motifs to sequences. Motifs can beviewed as compact expressions for a protein family, analternative to representing the family as a list of itsmembers. Furthermore, a match to a motif may notbe statistically signicant but may be biologically sig-nicant because higher scores may not occur whenapplying the motif to other protein families. If adiscriminating motif matches a sequence of unknownstructure it can be inferred that the sequence has the sameprotein fold as the family.The computational techniques used to create motifs

    fall into four classes. The standard technique createsthemotif using variation observed in columns of amultiplesequence alignment of the family. There are several waysto compute motifs from a multiple alignment (Gribskovet al., 1987; Tatusov et al., 1994). Other techniques aremachine learning algorithms that attempt to create motifswithout requiring a multiple alignment of the family(Brazma et al., 1998). The hidden Markov modeltechniques (Krogh et al., 1994) try to t available data toa sequence of probability distributions using a localoptimization algorithm. Finally, some techniques areiterative algorithms, which generalize a motif by repeatedsearches of a sequence database (Tatusov et al., 1994) usingthe evolving motif.

    Fold recognition

    Often a new protein sequence contains no recognizablemotifs, nor can its structure be inferred by comparativemodelling. In such cases, one can resort to fold recognitionapproaches. The task of fold recognition is easily denedbut notoriously dicult to solve: for a given sequence,determine which, if any, structures in the PDB arecompatible with the sequence.Because the function of a protein is determined by

    its three-dimensional structure, mutations causingamino acid changes that grossly alter the structure ofthe protein will usually inactivate the protein function andwill be selected against by evolution. It is for this reasonthat, despite the vast space of protein sequences exploredby evolution over the ages, there probably exist onlyseveral thousand unique protein topologies (Hubbardet al., 1992). As the PDB continues to expand with newsolved protein structures, the chance that a new geneproduct folds like a known structure will continue toincrease.Fold recognition by threading is a new, powerful

    technique. Threading methods are based upon theassumptions that protein structures are in a state ofminimum free energy, and that this energy can be roughlycomputed for any given structure. The energy computationtakes into account the compatibility of dierent aminoacids at each position in the structure. This compatibilityusually reects the preference of hydrophobic amino acidsin the core environment of the protein, and the potentialenergy created when two amino acids are spatially close toone another.Given a function that can evaluate the compatibility of

    a sequence with a structural template whose nativesequence has been removed, threading algorithmsattempt to minimize this function by considering variouspossible sequence to structure alignments. The threadingtask is enormously complex since exponentially many(as a function of sequence and structure sizes) alignmentsare possible, and the presence of arbitrarily manypairwise interactions in a protein structure precludesthe use of dynamic programming alignment algorithmsto produce optimal solutions. There are two interestingheuristic algorithms for obtaining at least a feasiblesolution in the face of this complexity. One is theapproach of Jones et al. (1992) which uses a variantof the standard dynamic programming algorithm.Another is the statistical sampling approach of Madejet al. (1995), which iteratively modies a working align-ment until a local minima is reached. Both approacheshave had some success in predicting the fold of unknownproteins, although low selectivity (proteins of dierentstructure appearing to be compatible) continues to be anissue.

    Bioinformatics

    7ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net

  • Conclusion

    As complete genome sequences become available andmany more protein structures are solved, new challengesfor bioinformatics are appearing. Investigators are justbeginning to address the questions of living organisms asdynamic systems and these explorations will once againexpand the scope of bioinformatics. Advances in thevarious arenas of bioinformatics holds the promise ofrevolutionizing biological understanding and therebycontributing to progress in preventing and treating disease.

    References

    Altschul SF, Gish W, Miller W, Myers EW and Lipman DJ (1990)

    BLAST Basic Local Alignment Search Tool. Journal of Molecular

    Biology 215: 403410.

    BrazmaA, Jonassen I, Eidhammer I andGilbertD (1998)Approaches to

    the automatic discovery of patterns in biosequences. Journal of

    Computational Biology 5(2): 279305.

    Chou KC (1995) A novel approach to predicting protein structural

    classes in a (20-1)-D amino acid composition space. Proteins 21(4):

    319344.

    Foote S, Vollrath D, Hilton A and Page DC (1992) The human Y

    chromosome: overlapping DNA clones spanning the euchromatic

    region. Science 258: 6066.

    Gribskov M, McLachlan AD and Eisenberg D (1987) Prole analysis:

    detection of distantly related proteins. Proceedings of the National

    Academy of Sciences of the USA 84(13): 43554358.

    Hubbard TJ, Ailey B, Brenner SE et al. (1992) SCOP: a structural

    classication of proteins database. Nucleic Acids Research 27: 254

    256.

    Jones DT, Taylor WR and Thornton JM (1992) A new approach to

    protein fold recognition. Nature 358: 8689.

    Krogh A, Brown M, Mian IS, Sjolander K and Haussler D (1994)

    Hidden Markov models in computational biology. Applications to

    protein modeling. Journal of Molecular Biology 235: 15011531.

    Levitt M and Chothia C (1976) Structural patterns in globular proteins.

    Nature 261: 552558.

    Madej T, Gibrat JF and Bryant SH (1995) Threading a database of

    protein cores. Proteins 23(3): 356369.

    Needleman SB and Wunsch CD (1970) A general method applicable to

    the search for similarities in the amino acid sequence of two proteins.

    Journal of Molecular Biology 48(3): 443453.

    Rost B and Sander C (1993) Prediction of protein secondary structure at

    better than 70%accuracy. Journal ofMolecular Biology 232: 584599.

    SanderC andSchneiderR (1991)Database of homology-derived protein

    structures and the structural meaning of sequence alignment. Proteins

    9(1): 5668.

    Smith TF and Waterman MS (1981) Identication of common

    molecular subsequences. Journal of Molecular Biology 147: 195197.

    Tatusov RL, Altschul SF andKoonin EV (1994) Detection of conserved

    segments in proteins: iterative scanning of sequence databases with

    alignment blocks. Proceedings of the National Academy of Sciences of

    the USA 91: 1209112095.

    Further Reading

    Baxevanis A and Ouellette BFF (eds) (1998) Bioinformatics: A Practical

    Guide to the Analysis of Genes and Proteins. Chichester: John Wiley

    and Sons.

    Lesk AM (ed.) (1988) Computational Molecular Biology, Sources and

    Methods for Sequence Analysis. Oxford: Oxford University Press.

    GribskovMandDevereux J (eds) (1991)Sequence Analysis Primer. New

    York: Stockton Press.

    Schuler GD, Boguski MS, Stewart EA et al. (1996) A gene map of the

    human genome. Science 274: 540546. [http://www.ncbi.nlm.nih.gov/

    genemap/]

    Smith CM (1997) The CMS Molecular Biology Resource: Bio-Web

    resources organized by analytical function. Trends in Genetics 13: 416.

    [(1998)MolyBio. Science 281: 139]

    Vogel F and Motulsky AG (1997) Human Genetics and Approaches.

    Berlin: Springer.

    von Heijne G (1987) Sequence Analysis in Molecular Biology, Treasure

    Trove or Trivial Pursuit. San Diego: Academic Press.

    Bioinformatics

    8 ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net