databases in bioinformatics_abridged

11
DATABASES IN BIOINFORMATICS A Database Allows for proper storing, searching and retrieving of data. It is useful to Handle and share large volumes of data Support large-scale analysis efforts and Make data access easy and updated Biological databases are libraries of life sciences information, collected from scientific experiments, published literature, high throughput experiment technology, and computational analyses. They are a Storehouse Containing Information Of Dna , Proteins , Genes , Chromosomes ,Genomes ,Proteomes , Organisms , Tax onomy, Journals Etc. Types of Databases Information they contain Bibliographic Databases Literature Taxonomic Databases Classification  Nucleic Acid Databases DNA information Genomic Databases Gene level information Protein Databases Protein Information Secondar y Sequence Pr ot ei n Da ta ba ses Cl assificati on of pr ot ei n doma in s Enzymes / Metabolic pathways Metabolic Pathways Primary databases • They contain sequence data of nucleic acid or protein • Example of primary databases includes: Nucleic Acid Databases Protein Databases Genbank SWISS-PROT EMBL TREMBL DDBJ PIR  The International Sequence Database Collaboration These three databases have collaborated since 1982. Each database collects and  processes new sequence data and relevant biological information from scientists in their region e.g. EMBL collects from Europe, GenBank from the USA and DDBJ from Japan. • These databases automatically update each other with the new sequences collected from each region, every 24 hours. The result is that they contain exactly the same information, except for any sequences that have been added in the last 24 hours. 1

Upload: kumarvino117

Post on 07-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

8/3/2019 Databases in Bioinformatics_abridged

http://slidepdf.com/reader/full/databases-in-bioinformaticsabridged 1/11

DATABASES IN BIOINFORMATICS

A Database Allows for proper storing, searching and retrieving of data.It is useful to

Handle and share large volumes of data

Support large-scale analysis efforts andMake data access easy and updated

Biological databases are libraries of life sciences information, collected from scientificexperiments, published literature, high throughput experiment technology, andcomputational analyses. They are a Storehouse Containing Information Of Dna ,Proteins , Genes , Chromosomes ,Genomes ,Proteomes , Organisms , Taxonomy,Journals Etc.

Types of Databases Information they containBibliographic Databases LiteratureTaxonomic Databases Classification

Nucleic Acid Databases DNA informationGenomic Databases Gene level informationProtein Databases Protein InformationSecondary Sequence Protein Databases Classification of protein domainsEnzymes / Metabolic pathways Metabolic Pathways

Primary databases

• They contain sequence data of nucleic acid or protein

• Example of primary databases includes:

Nucleic Acid Databases Protein Databases

Genbank SWISS-PROT

EMBL TREMBL

DDBJ PIR

The International Sequence Database Collaboration• These three databases have collaborated since 1982. Each database collects and

processes new sequence data and relevant biological information from scientists in their region e.g. EMBL collects from Europe, GenBank from the USA and DDBJ from Japan.• These databases automatically update each other with the new sequences collected fromeach region, every 24 hours. The result is that they contain exactly the same information,except for any sequences that have been added in the last 24 hours.

1

8/3/2019 Databases in Bioinformatics_abridged

http://slidepdf.com/reader/full/databases-in-bioinformaticsabridged 2/11

NCBIEstablished in 1988 as a national resource for molecular biology information, NCBIcreates public databases, conducts research in computational biology, develops softwaretools for analyzing genome data, and disseminates biomedical information - all for the

better understanding of molecular processes affecting human health and disease.The National Center for Biotechnology Information (NCBI) is part of the United States

National Library of Medicine (NLM), a branch of the National Institutes of Health. The NCBI is located in Bethesda, Maryland, USA.

PRIMARY NUCLEOTIDE SEQUENCE DATABASESGenBank

GenBank is a comprehensive public database of nucleotide sequences and supporting

bibliographic and biological annotation,

built and distributed by the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine(NLM), located on the campus of the US National Institutes of Health (NIH) in Bethesda,USA. NCBI builds GenBank primarily from the submission of sequence data fromauthors and from the bulk submission of expressed sequence tag (EST), genome surveysequence (GSS), and other high-throughput data from sequencing centers. GenBank, theEuropean Molecular Biology Laboratory Nucleotide Sequence Database (EMBL) inEurope, and the DNA Databank of Japan (DDBJ) comprise the International NucleotideSequence Database Collaboration (INSDC), and are members of a long-standingcollaboration in which data is exchanged daily to ensure a uniform and comprehensivecollection of sequence information.

GenBank is accessible through NCBI's retrieval system, Entrez, which integrates datafrom the major DNA and protein sequence databases along with taxonomy, genome,mapping, protein structure and domain information, and the biomedical journal literaturevia PubMed. Thus if the user specifies the name of a protein and asks for a ‘nucleotide’search, Entrez will look for the corresponding gene sequence.Thus a sample Genbank entry would give the result for the search for the keyword‘chymotrypsin’ as following:

2

8/3/2019 Databases in Bioinformatics_abridged

http://slidepdf.com/reader/full/databases-in-bioinformaticsabridged 3/11

ACCESSION : It defines a field containing unique identification numbers. The sequenceand other information may be retrieved from the database simply by searching for a givenaccession number.LOCUS : This is a GenBank title that names the sequence entry. Apart from the accessionnumber, it also specifies the number of bases in the entry, the nucleic acid type, a

codeword PRI that indicates that the sequence is from a primate, and the date on whichthe entry was made.DEFINITION : gives the name of the sequence of the entry.The UNIQUE ACCESSION NUMBER comes next, followed by a version number incase the entry has gone through more than one version.KEYWORDS : It is a list of specially defined keywords that used to index the entriesSOURCE : It describes the organism from which sequence was extracted. The completescientific classification is given.REFERENCE : This field will contain the title of the article, the names of the authors,and the name, volume, page numbers and publication year of the journal in which thearticle appeared.

FEATURES TABLE : It is one of the most important pieces of informationaccompanying the sequence. It is an annotation of the sequence and describes whatever the submitters know about the sequence. The features (for nucleic acid sequences)include coding regions, exons, introns, promoters, alternate splice patterns, mutations,variations, and a translation into a protein sequence, if it codes for one. After the featurestable, a single line gives the base count statistics for the sequence.SEQUENCE : It is typed in the lowercase, and for ease of reading, each line is dividedinto six columns of ten bases each. A single number on the left numbers the bases.

LOCUS NM_007272 894 bp mRNA linear PRI 17-DEC-2004DEFINITION Homo sapiens chymotrypsin C (caldecrin) (CTRC), mRNA.ACCESSION NM_007272VERSION NM_007272.1 GI:11321627KEYWORDSSOURCE Homo sapiens (human)

ORGANISM Homo sapiensEukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.

REFERENCE 1 (bases 1 to 894)AUTHORS Tomomura,A., Akiyama,M., Itoh,H., Yoshino,I., Tomomura,M.,

Nishii,Y., Noikura,T. and Saheki,T.TITLE Molecular cloning and expression of human caldecrinJOURNAL FEBS Lett. 386 (1), 26-28 (1996)MEDLINE 96221265

PUBMED 8635596COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final

NCBI review. The reference sequence was derived from S82198.1.FEATURES Location/Qualifiers

source 1..894/organism="Homo sapiens"/mol_type="mRNA"/db_xref="taxon:9606"/chromosome="1"/map="1p36.21"

gene 1..894/gene="CTRC"/note="synonym: CLCR"/db_xref="GeneID:11330"

3

8/3/2019 Databases in Bioinformatics_abridged

http://slidepdf.com/reader/full/databases-in-bioinformaticsabridged 4/11

/db_xref="LocusID:11330"/db_xref="MIM:601405"

CDS 1..807/gene="CTRC"/note="caldecrin (serum calcium decreasing factor,elastase IV); OTTHUMP00000044526;go_function: trypsin activity [goid 0004295] [evidence

IEA];go_function: chymotrypsin activity [goid 0004263][evidence IEA];go_function: peptidase activity [goid 0008233] [evidenceTAS] [pmid 8635596];go_process: proteolysis and peptidolysis [goid 0006508][evidence TAS] [pmid 8635596]"/codon_start=1/product="chymotrypsin C (caldecrin)"/protein_id="NP_009203.1"/db_xref="GI:11321628"/db_xref="GeneID:11330"/db_xref="LocusID:11330"/db_xref="MIM:601405"/translation="MLGITVLAALLACASTCGVPSFPPNLSARVVGGEDARPHSWPWQ

ISLQYLKNDTWRHTCGGTLIASNFVLTAAHCISNTRTYRVAVGKNNLEVEDEEGSLFVGVDTIHVHKRWNALLLRNDIALIKLAEHVELSDTIQVACLPEKDSLLPKDYPCYVTGWGRLWTNGPIADKLQQGLQPVVDHATCSRIDWWGFRVKKTMVCAGGDGVISACNGDSGGPLNCQLENGSWEVFGIVSFGSRRGCNTRKKPVVYTRVSAYIDWINEKMQL"

ORIGIN1 atgttgggca tcactgtcct cgctgcgctc ttggcctgtg cctccacctg tggggtgccc

61 agcttcccgc ccaacctatc cgcccgagtg gtgggaggag aggatgcccg gccccacagc121 tggccctggc agatctccct ccagtacctc aagaacgaca cgtggaggca tacgtgtggc181 gggactttga ttgctagcaa cttcgtcctc actgccgccc actgcatcag caacacccgg241 acctaccgtg tggccgtggg aaagaacaac ctggaggtgg aagacgaaga aggatccctg301 tttgtgggtg tggacaccat ccacgtccac aagagatgga atgccctcct gttgcgcaat361 gatattgccc tcatcaagct tgcagagcat gtggagctga gtgacaccat ccaggtggcc

421 tgcctgccag agaaggactc cctgctcccc aaggactacc cctgctatgt caccggctgg481 ggccgcctct ggaccaacgg ccccattgct gataagctgc agcagggcct gcagcccgtg541 gtggatcacg ccacgtgctc caggattgac tggtggggct tcagggtgaa gaaaaccatg601 gtgtgcgctg ggggcgatgg cgtcatctca gcctgcaatg gggactccgg tggcccactg661 aactgccagt tggagaacgg ttcctgggag gtgtttggca tcgtcagctt tggctcccgg721 cggggctgca acacccgcaa gaagccggta gtctacaccc gggtgtccgc ctacatcgac781 tggatcaacg agaaaatgca gctgtgattt gttgctggga gcggcggcag cgagtccctg841 caacagcaat aaacttcctt ctcctcgggc cacctgaaaa aaaaaaaaaa aaaa

//

Fig: The output of a search of the nucleotide database in NCBI usingthe keyword ‘chymotrypsin’

EMBLThe EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl), maintained atthe European Bioinformatics Institute (EBI) near Cambridge, UK, is a comprehensivecollection of nucleotide sequences and annotation from available public sources.Also known as EMBL-Bank, it constitutes Europe's primary nucleotide sequenceresource. Main sources for DNA and RNA sequences are direct submissions fromindividual researchers, genome sequencing projects and patent applications.The database is produced in an international collaboration with GenBank (USA) and the

4

8/3/2019 Databases in Bioinformatics_abridged

http://slidepdf.com/reader/full/databases-in-bioinformaticsabridged 5/11

8/3/2019 Databases in Bioinformatics_abridged

http://slidepdf.com/reader/full/databases-in-bioinformaticsabridged 6/11

PRIMARY PROTEIN SEQUENCE DATABASES

The ExPASy (the Expert Protein Analysis System)World Wide Web server (http://www.expasy.org), is provided as a service to the life science community by amultidisciplinary team at the Swiss Institute of Bioinformatics (SIB). ExPASy started to

operate in 1993, as the first WWW server in the field of life sciences. In addition to themain site in Switzerland, seven mirror sites in different continents currently serve the user community. It provides access to a variety of databases and analytical tools dedicated to

proteins and proteomics. ExPASy databases include SWISSPROT and TrEMBL,SWISS-2DPAGE, PROSITE, ENZYME and the SWISS-MODEL repository. Analysistools are available for specific tasks relevant to proteomics, similarity searches, patternand profile searches, post-translational modification prediction, topology prediction,

primary, secondary and tertiary structure analysis and sequence alignment. Thesedatabases and tools are tightly interlinked: a special emphasis is placed on integration of database entries with related resources developed at the SIB and elsewhere, and the

proteomics tools have been designed to read the annotations in SWISS-PROT in order to

enhance their predictions.

SWISS-PROTTREMBLPIR

UniProtKB

The UniProt Knowledgebase (UniProtKB) is the central hub for the collection of functional information on proteins, with accurate, consistent and rich annotation. In

addition to capturing the core data mandatory for each UniProtKB entry (mainly, theamino acid sequence, protein name or description, taxonomic data and citationinformation), as much annotation information as possible is added.

The UniProt Knowledgebase consists of two sections: a section containing manually-annotated records with information extracted from literature and curator-evaluatedcomputational analysis, and a section with computationally analyzed records that awaitfull manual annotation. For the sake of continuity and name recognition, the two sectionsare referred to as "UniProtKB/Swiss-Prot" (reviewed, manually annotated) and"UniProtKB/TrEMBL" (unreviewed, automatically annotated), respectively.

PIR

The Protein Information Resource (PIR) is an integrated public bioinformatics resourceto support genomic, proteomic and systems biology research and scientific studies. ThePIR-PSD is a collaborative endeavour between the PIR, the MIPS (Munich InformationCentre for Protein Sequences, Germany) and the JIPID (Japan International ProteinInformation Database, Japan). This database grew with the great work of MargaretDayhoff and her colleagues and collaborators at the NBRF. Beginning in the 1960’s,

6

8/3/2019 Databases in Bioinformatics_abridged

http://slidepdf.com/reader/full/databases-in-bioinformaticsabridged 7/11

before the presence of computerised databases, Dayhoff collected published proteinsequences, and on the basis of alignments and sequence comparisons, classified andannotated the collection manually, and made it available to the scientific community inthe form of the Atlas of Protein Sequence and Structure, a set of printed volumes issued

periodically. The Atlas of Protein Sequence and Structure grew into the PIR-PSD, and

continues the tradition of high-quality annotation, except that the classification is nowcarried out by well-validated automatic procedures. Also the database is now fullycomputerised and uses the Oracle object relational DBMS. The PIR-PSD is now acomprehensive, non-redundant, expertly annotated, fully classified and extensively cross-referenced protein sequence database in the public domain. It is available athttp://pir.georgetown.edu/pirwww. A unique characteristic of the PIR-PSD is itsclassification of protein sequences based on the superfamily concept . Sequences in PIR-PSD are also classified based on homology domains and sequence motifs.

Secondary databases

They are sometimes known as pattern databases and contain results from the analysis of the sequences in the primary databases. Derived or Secondary Databases of AminoAcid Sequences are 2 set of databases collects together patterns found in proteinsequences rather than the complete sequences.Example of secondary databases include:

PROSITEPfamBLOCKSPRINTSProDom

PROSITE is one such pattern database, which is accessible athttp://www.expasy.ch/prosite. The protein motifs or patterns are encoded as ‘regular expressions’. The information corresponding to each entry in PROSITE is of two forms – the pattern and the related descriptive text.

The PRINTS database ( http://www.bioinf.man.ac.uk/dbbrowser/PRINTS ) stores the protein sequence patterns as ‘fingerprints’. A fingerprint is a set of motifs or patternsrather than a single one. The advantage of a fingerprint over a single motif is that asequence which is closely related to given family of proteins would share all the motifsthat define the fingerprint for that family, while more distantly related sequences or members of a subfamily would share one or more of the motifs, but not all.

The Pfam database is a large collection of protein families, each represented by multiplesequence alignments and hidden Markov models (HMMs). Proteins are generallycomposed of one or more functional regions, commonly termed domains. Differentcombinations of domains give rise to the diverse range of proteins found in nature. Theidentification of domains that occur within proteins can therefore provide insights intotheir function. There are two components to Pfam: Pfam-A and Pfam-B. Pfam-A entriesare high quality, manually curated families. Automatically generated entries of Pfam-B

7

8/3/2019 Databases in Bioinformatics_abridged

http://slidepdf.com/reader/full/databases-in-bioinformaticsabridged 8/11

are used for identifying functionally conserved regions when no Pfam-A entries arefound.

Genome Databases

TIGR

The Institute for Genomic Research (TIGR ) was a non-profit genomics researchinstitute founded in 1992 by Craig Venter in Rockville, Maryland, United States . It isnow a part of the J. Craig Venter Institute .TIGR's Genome Projects are a collection of curated databases containing DNA and protein sequence, gene expression, cellular role,

protein family, and taxonomic data for microbes, plants and humans. TIGR sequenced the first genome of a free-living organism, the bacterium Haemophilus influenzae , in1995. sequence the second bacterium, Mycoplasma genitalium in 1996, and less than ayear later TIGR's Carol Bult led the project to sequence the first genome of an Archaeal species, Methanococcus jannaschii . TIGR followed these accomplishments with the

genomes of the pathogenic bacteria Borrelia burgdorferi[1]

(which causes Lyme Disease)in 1997, and Treponema pallidum (which causes syphilis) in 1998. In 1999 TIGR

published the sequence of the radioresistant polyextremophile Deinococcus radiodurans .

SANGER

The Sanger Institute is a genome research institute primarily funded by the WellcomeTrust. We use large-scale sequencing, informatics and analysis of genetic variation tofurther our understanding of gene function in health and disease and to generate data andresources of lasting value to biomedical research. The Wellcome Trust Sanger Institute(WTSI), formerly the Sanger Centre, is a genomics research institute. [1] The institute is

named after double Nobel Laureate , Frederick Sanger. It is located on the WellcomeTrust Genome Campus, in the United Kingdom, by the village of Hinxton , outsideCambridge. It shares this location with the European Bioinformatics Institute (EBI).

Metabolic Pathway Databases

KEGG

KEGG is an integrated database resource consisting of 16 main databases, broadlycategorized into systems information, genomic information, and chemical information asshown below. Genomic and chemical information represents the molecular building

blocks of life in the genomic and chemical spaces, respectively, and systems informationrepresents functional aspects of the biological systems, such as the cell and the organism,that are built from the building blocks. KEGG has been widely used as a referenceknowledge base for biological interpretation of large-scale datasets generated bysequencing and other high-throughput experimental technologies. There are many typesof databases in KEGG. KEGG Pathway is a database for metabolism and other cellular

processes, as well as human diseases; manually created from published materials.KEGGgene is a database of Gene catalogs of complete genomes with manual annotation.

8

8/3/2019 Databases in Bioinformatics_abridged

http://slidepdf.com/reader/full/databases-in-bioinformaticsabridged 9/11

EMP

The Enzymes and Metabolic Pathways database (EMP) is a database on the biochemistryof some 1800 different organisms. The various data available are Amino AcidMetabolism, Aromatic Hydrocarbons, Carbohydrate Metabolism, Coenzymes andVitamins, Electron Transport, Enzyme Metabolism, Lipid Metabolism and etc. Anextraction of over 1800 pictorial representations of metabolic pathways from thiscollection is freely available on the World Wide Web. This collection will play animportant role in the interpretation of genetic sequence data, as well as offering ameaningful framework for the integration of many other forms of biological data.

EcoCyc

EcoCyc is a bioinformatics database that describes the genome and the biochemicalmachinery of E. coli K-12 MG1655. The long-term goal of the project is to describe themolecular catalog of the E. coli cell, as well as the functions of each of its molecular

parts, to facilitate a system-level understanding of E. coli. EcoCyc is an electronicreference source for E. coli biologists, and for biologists who work with relatedmicroorganisms. EcoCyc contains the complete genome sequence of E. coli, anddescribes the nucleotide position and function of every E. coli gene. It includes thecomplete description of the genetic network of any organism, including E. coli operons ,

promoters, transcription factors, and transcription-factor binding sites.

BioCyc

The BioCyc collection of Pathway/Genome Databases (PGDBs) provides electronicreference sources on the pathways and genomes of different organisms.

Specialized Databases

IMGT, the International ImMunoGeneTics information system®http://www.imgt.org, is a high-quality integrated knowledge resource specialized in theimmunoglobulins (IG), T cell receptors (TR), major histocompatibility complex (MHC),immunoglobulin superfamily (IgSF), major histocompatibility complex superfamily

(MhcSF) and related proteins of the immune system (RPI) of human and other vertebratespecies. It was created in 1989 by Marie-Paule Lefranc. IMGT, a European project since1992, works in close collaboration with EBI .

REBASE is a comprehensive database of information about restriction enzymes, DNAmethyltransferases and related proteins involved in restriction-modification. The contentsof REBASE are available by browsing from the web

9

8/3/2019 Databases in Bioinformatics_abridged

http://slidepdf.com/reader/full/databases-in-bioinformaticsabridged 10/11

(http://rebase.neb.com/rebase/rebase.html) and through selected compilations by ftp(ftp.neb.com) and as monthly updates that can be requested via email.

BRENDA is the main collection of enzyme functional data available to the scientificcommunity. It is available free of charge for via the internet ( www.brenda-enzymes.org )

and as an in-house database for commercial users (requests to our distributor Biobase ).BRENDA is maintained and developed at the Institute of Biochemistry andBioinformatics at the Technical University of Braunschweig, Germany. Data on enzymefunction are extracted directly from the primary literature by scientists holding a degreein Biology or Chemistry. Formal and consistency checks are done by computer programs,each data set on a classified enzyme is checked manually by at least one biologist and onechemist. The enzymes are classified according to the Enzyme Commission list of enzymes. Some 4000 "different" enzymes are covered. The data collection is beingdeveloped into a metabolic network information system with links to Enzyme expressionand regulation information.

LIGAND data is available through following Databases.

KEGG LIGAND is a composite database consisting of COMPOUND, GLYCAN,REACTION, RPAIR, and ENZYME databases, whose entries are identified by C, G, R,RP, and EC numbers, respectively.

PDB-Ligand (http://www.idrtech.com/PDB- Ligand /) is a three-dimensional structuredatabase of small molecular ligands that are bound to larger biomolecules deposited in theProtein Data Bank (PDB). PDBSum, Relibase are examples of few more liganddatabases.

STRUCTURAL DATABASES

Structure databases, like sequence databases, come in two varieties, primary andsecondary. These databases get their data derived from mainly three sources namelyStructures determined by X-ray crystallography forming the large majority followed bystructures arrived by NMR experiments. There are also several structures obtained bymolecular modeling using Bioinformatics tools.

Primary Structure Databases

PDB - Protein Data Bank

CSD - Cambridge Structural Database

Derived or Secondary Databases of Biomolecular Structures

SCOP - Structural Classification Of ProteinsCATH - Class, Architecture, Topology and Homologous superfamilyPALI - Phylogeny and ALIgnment of homologous protein structuresNDB - Nucleic acid Data Base

10

8/3/2019 Databases in Bioinformatics_abridged

http://slidepdf.com/reader/full/databases-in-bioinformaticsabridged 11/11

FSSP - Fold classification based on Structure-Structure alignment of Proteins

PDB stands for Protein Data Bank. In spite of the name, PDB archives the three-dimensional structures of not only proteins but also all biologically important molecules,such as nucleic acid fragments, RNA molecules, large peptides such as the antibiotic

gramicidin, and complexes of proteins and nucleic acids. The data in the PDB isorganised as flat files. Examples of PDB files are Pdb100d, Pdb200d. The PDB is a keyresource in areas of structural biology , such as structural genomics . The PDB was begunat the Brookhaven National Laboratory in USA, and was known for a long time as theBrookhaven protein data bank. In recent times its curation and maintenance has beentaken over by a consortium of three organisations called the Research Collaboratory for Structural Bioinformatics, comprising of Rutgers University, the San DiegoSupercomputer Centre and the National Institute of Standards and Technology. It isavailable over the Internet at http://www.rcsb.org .

The Cambridge Structural Database (CSD) was originally a project of the University

of Cambridge, which set up the Cambridge Crystallographic Data Centre (CCDC) tocollect together the published three-dimensional structures of small organic molecules.For each entry in the CSD there are three distinct types of information stored. These arecategorised as bibliographic information, chemical connectivity information and thethree-dimensional coordinates.

Bibliographic databases – Literature databasesThey contain various journals, abstracts and scientific articles.

Pubmedmedline

11