identification of gene from databases

Upload: hiren-m-patel

Post on 30-May-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 Identification of Gene From Databases

    1/61

    Identification of

    Genes fromDatabasesPresented to

    Dr.C.G. JoshiProfessor

    Department of Animal Biotechnology,College of Veterinary Science & A.H.,Anand

    Presented byPatel Hiren M

    MVsc(Anim .Biotechnology)

  • 8/14/2019 Identification of Gene From Databases

    2/61

    Introduction

    Evolution is somewhat conservative

    Evolution seems to have often involved the

    duplication and divergence of gene Certain sequence may indicate a certain

    function

    A structural set of data held in a computer The structures of new genes are constantly

    adding

  • 8/14/2019 Identification of Gene From Databases

    3/61

    Databases

    Require Some Basic Knowledge

    What is gene

    What is gene structure

    - Prokaryotes

    - Eukaryotes

    ORFs (Open Reading Frame) cDNA library

    Human genome project

  • 8/14/2019 Identification of Gene From Databases

    4/61

    Prokaryote Genome

    In prokaryotes ,introns are less common and

    genes often contain a single uninterrupted

    stretch of DNA, called a cistron, that codes

    for a product.

    These functionally related genes are in

    clustered and can be transcribed together on

    same mRNA

  • 8/14/2019 Identification of Gene From Databases

    5/61

  • 8/14/2019 Identification of Gene From Databases

    6/61

    Eukaryotic Genes

    Much more complex than in prokaryotes.

    Large genomes (0.1 to 3 billion bases)

    A typical mammalian cell has 1,500 times asmuch DNA than the cell ofE. Coli.

    Low coding density (

  • 8/14/2019 Identification of Gene From Databases

    7/61

    Gene Structure Eukaryotes

  • 8/14/2019 Identification of Gene From Databases

    8/61

    Data Mining

    Development of new tools for datamining

    Sequence alignment

    Genome sequencing

    Genome comparison

    Micro array data analysis

    Proteomics data analysis

    Small molecular array analysis

  • 8/14/2019 Identification of Gene From Databases

    9/61

    What is a database?

    A database is a collection ofinformation stored in a computer in a

    systematic way, such that acomputer program can consult it toanswer questions

    The software used to manage andquery a database is known as adatabase management system(DBMS)

    The properties of database systems

  • 8/14/2019 Identification of Gene From Databases

    10/61

    Annotation Forms

    PubMed

    Select articles that are notcited on RegulonDB

    Read and select abstracts withtranscriptional information

    REGULONDB

    Enter data into databasethrough annotation forms

    Read complete articlesabout conditions

    Read completegeneral articles

    Articledatabase

    Selected

    abstracts

    Is this a general

    article?

    Keywords

    Search

    Found

    Articles

    Articles Classification

  • 8/14/2019 Identification of Gene From Databases

    11/61

    Annotation formsRegistration Format

  • 8/14/2019 Identification of Gene From Databases

    12/61

    Annotation forms

  • 8/14/2019 Identification of Gene From Databases

    13/61

    What makes a gooddatabase?

    Quality

    Manual (slow)

    No overlapbetween entries

    Reliable

    Some datamight be missing

    Coverage

    Automatic (fast)

    Overlappingentries

    Errors, biases

    Up-to-date

  • 8/14/2019 Identification of Gene From Databases

    14/61

    for Gene

    Identification Find candidate genes for the trait

    (time and cost!)

    -What genes are there?

    -How gene are expressed

    -What do they do?

    -How could they play a role inthe disease

    -Gene synonyms

    -Gene location

  • 8/14/2019 Identification of Gene From Databases

    15/61

    GENE SCORING

  • 8/14/2019 Identification of Gene From Databases

    16/61

    DATA SOURCES

    PubMed Conserved Domain Database

    Conserved Domain Database

    GeneAtlas

    dbSNP

    Links to above-mentioned databases:Gene: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genePubMed: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMedCDD: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cddHomologene:

    http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=homologeneGeneAtlas: http://wombat.gnf.org/

  • 8/14/2019 Identification of Gene From Databases

    17/61

    Gene Databases

    Once a genome is in place, it isdesirable to study the regions thatmake a particular organism what it

    is.

    One such resource is located in thegenetic regions of the organism,

    Several databases of genes andrelated structures exist.

    Such database is the RefSeq

    database curated at NCBI.

  • 8/14/2019 Identification of Gene From Databases

    18/61

    Classification of Databases

    Primary sequence databases

    Secondary sequence databases

    Genomic database resource

  • 8/14/2019 Identification of Gene From Databases

    19/61

    Secondary sequence databases

    Unigene:-Historical use for selecting sequences for micro

    array. The TIGR Gene Indices:-

    The gene indices at the institute for geneticresearch are arranged according to species.The TIGR GI covers 19 animal species, 18 plant

    species and 7 fungal species. TIGR also include fullinformation about splice variance in the database.

    Ref Seq. (NCBI References Sequence Project):-The Ref Seq. aims to collect sequences of

    genomes, complete chromosome genomic regions;mRNAs & other type of RNA.

  • 8/14/2019 Identification of Gene From Databases

    20/61

    What are genomedatabases?

    Genome databases contain, well,genomic information collected frommany sources.

    Genome assembly Gene predictions

    Known genes, mRNA, ESTs, proteins

    Genetic maps, markers andpolymorphisms

    Gene expression and phenotypes

    Annotations

    Interspecies homologues

  • 8/14/2019 Identification of Gene From Databases

    21/61

    Genomic Database Resource

    Ensembl

    - http://www.ensembl.org

    19 species

    UCSC Genome Browser- http://genome.ucsc.edu/

    28 species (Insects!)

    NCBI MapViewer- http://www.ncbi.nlm.nih.gov/mapview/

    38 species (Plants, Fungi!)

  • 8/14/2019 Identification of Gene From Databases

    22/61

    Queries to Ensembl

  • 8/14/2019 Identification of Gene From Databases

    23/61

    Browsegenome

    Quicksearch

  • 8/14/2019 Identification of Gene From Databases

    24/61

    Quick search results

  • 8/14/2019 Identification of Gene From Databases

    25/61

    Geneviewlink

  • 8/14/2019 Identification of Gene From Databases

    26/61

    Gene View

  • 8/14/2019 Identification of Gene From Databases

    27/61

  • 8/14/2019 Identification of Gene From Databases

    28/61

    Various Gene Databases

    are: SNP Resources

    - dbSNP: database of single nucleotide

    polymorphismshttp://www.ncbi.nlm.nih.gov/

    SNP/

    - SNP Consortiumdatabase:http://snp.cshl.org/

    rSNP

    Guide:http://util.bionet/nsc.ru/database

  • 8/14/2019 Identification of Gene From Databases

    29/61

    SNPs at NCBI

  • 8/14/2019 Identification of Gene From Databases

    30/61

    Select SNPdatabase

    Free text query goes

    tohere

  • 8/14/2019 Identification of Gene From Databases

    31/61

    Accession numberfor the

    SNP, and species.

    Location of the SNP inthesequence

    Links toalternative

    views

  • 8/14/2019 Identification of Gene From Databases

    32/61

    Various Gene Databasesare:

    EST Resources

    - ESTs are expressed sequence tags,which are partial copies of mRNAfound within a particular cell.

    - Information from ESTs can be usedto tell the splicing patterns of

    genes,he occurrence of genes, etc.

    dbESThttp://www.ncbi.nlm.nih.gov/dbEST/

  • 8/14/2019 Identification of Gene From Databases

    33/61

  • 8/14/2019 Identification of Gene From Databases

    34/61

    Various Gene Databasesare:

    Protein Databases

    -The process of the central dogma statesthat DNA gets coded into RNA, which inturn gets turned into proteins.

    - Since proteins code for genes, it isimportant to store known information aboutproteins inside of databases.

    - There are many different proteindatabases, many of them dealing withspecific protein families.

    Databases for curated proteins include:

    InterPro: Protein families and domainshtt ://www.ebi.ac.uk/inter ro

  • 8/14/2019 Identification of Gene From Databases

    35/61

    Various Gene Databasesare:

    Protein Sequence Motifs(Domains)

    - CDD:

    http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml

    - eMOTIF:

    http://motif.stanford.edu/emotif/- Pfam:http://www.sanger.ac.uk/Software/Pfa

    Structure Databases

    -

    http://www.sanger.ac.uk/Software/Pfam/http://www.sanger.ac.uk/Software/Pfam/
  • 8/14/2019 Identification of Gene From Databases

    36/61

  • 8/14/2019 Identification of Gene From Databases

    37/61

  • 8/14/2019 Identification of Gene From Databases

    38/61

    Various Gene Databasesare:

    Gene Expression Databases- Once the location and sequence of

    genes is known, the next step is todetermine their function.- Various biological experiments can be

    performed on gene data, including thenewer microarray technology which wewill cover in class

    - Databases containing the results ofthis experimental data are available

    KEGG http://www.genome.ad.jp/kegg/

    Klotho http://www.ibc.wustl.edu/klotho

  • 8/14/2019 Identification of Gene From Databases

    39/61

    Comparison of Sequence

    against Sequence Database The most commonly used programmes for

    comparing an unknown sequence against

    the sequence the database are BLAST,

    FASTA.

    BLAST and FASTA are derivatives of

    Smith - Watermann Algorithm.

  • 8/14/2019 Identification of Gene From Databases

    40/61

    The FASTA algorithm

    Developed by Lipman and Pearson 1985

    First program to search sequence

    databases for gapped local alignment The best scoring local region is given as

    output

    It is an approximate heuristic algorithmused to compute sub-optimal pair wisesimilarity.

    http://www-nbrf.georgetown.edu/pirwww/s

    40

    http://www-nbrf.georgetown.edu/pirwww/search/fasta.htmlhttp://www-nbrf.georgetown.edu/pirwww/search/fasta.html
  • 8/14/2019 Identification of Gene From Databases

    41/61

    BLAST

    BLASTn

    Megablast

    Nucleotide querry BLASTx Protein querry tBLASTn,

    Nucleotide querry tBLASTx

    Conserved Domains RPS-BLAST,CDART

    Pairwise BLAST 2BLAST

    41

  • 8/14/2019 Identification of Gene From Databases

    42/61

    What is BLAST?

    42

    BLAST (Basic LocalAlignment Search Tool) is aset of similarity search programs designed to

    explore all of the available sequence databases

    regardless of whether the query is protein or

    DNA.

    local means it searches and aligns sequence

    segments, rather than align the entire sequence.

    Its able to detect relationships among sequenceswhich share only isolated regions of similarity.

    Currently, it is the most popular and most

    accepted sequence analysis tool.

    Wh BLAST?

  • 8/14/2019 Identification of Gene From Databases

    43/61

    Why BLAST?

    43

    Identify unknown sequences - The best way to

    identify an unknown sequence is to see if thatsequence already exists in a public database. If

    the database sequence is a well-characterized

    sequence, then you may have access to a wealth

    of biological information.

    Help gene/protein function and structure

    prediction genes with similar sequences tend to

    share similar functions or structure.

    Identify protein family group related (paralog or

    ortholog) genes and their proteins into a family.

    Prepare sequences for multiple alignments

  • 8/14/2019 Identification of Gene From Databases

    44/61

    Go to BLAST

  • 8/14/2019 Identification of Gene From Databases

    45/61

    Go tonucleotide

    BLAST

  • 8/14/2019 Identification of Gene From Databases

    46/61

    Bl t 1

  • 8/14/2019 Identification of Gene From Databases

    47/61

    Blast 1

    47

    Blast 2

  • 8/14/2019 Identification of Gene From Databases

    48/61

    Blast 2

    48

    Blast limit by

  • 8/14/2019 Identification of Gene From Databases

    49/61

    Blast limit bytaxon

    49

    Bl t lt

  • 8/14/2019 Identification of Gene From Databases

    50/61

    Blast results

    50

    Interpret BLAST results

  • 8/14/2019 Identification of Gene From Databases

    51/61

    Interpret BLAST results -Distribution

    51

    This image shows the distribution of BLAST hits on the query sequence. Each

    line represents a hit. The span of a line represents the region where similarity is

    detected. Different colors represent different ranges of scores.

    Query sequence

    BLAST hits. Click

    to access the

    pairwise

    alignment.

    Interpret BLAST results

  • 8/14/2019 Identification of Gene From Databases

    52/61

    Interpret BLAST results -Description

    52

    ID (GI #, refseq #, DB-specific

    ID #) Click to access therecord in GenBank

    Bit score higher, better.Click to access the

    pairwise alignment

    Expect value lower, better. It

    tells the possibility that this isa random hit

    Gene/sequence

    Definition

    The description (also called definition) lines are listed below under the

    heading "Sequences producing significant alignments". The term"significant" simply refers to all those hits whose E value was less than the

    threshold. It does not imply biological significance.

    Link

    s

    I t t BLAST lt

  • 8/14/2019 Identification of Gene From Databases

    53/61

    Interpret BLAST results pairwise alingments

    53

    Query line: the segment from query sequence.

    Subj line: the segment from hit (subject) sequence.

    Middle line: the consensus bases

    SOFTWARES FOR

  • 8/14/2019 Identification of Gene From Databases

    54/61

    SOFTWARES FOR

    IDENTIFICATION OF GENES

    Some computational tools that are most

    commonly used for gene prediction

    Gene MarkGlimmer M

    GRAIL

    GenScanGenebuilder

    G M k

  • 8/14/2019 Identification of Gene From Databases

    55/61

    Gene Mark

    This software was developed by MarkBorodovsky and James Mc Ininch.

    This is used for finding prokaryotic

    genes. This software employs non-homogenous markov model toclassify DNA regions into protein

    coding, non-coding sequences

  • 8/14/2019 Identification of Gene From Databases

    56/61

    Glimmer

    This software was developed bySteven Salzberg et al. at JohnsHopkins University and TIGR.

    Glimmer uses interpolated markovmodels to identify coding regionsand distinguish them from non-

    coding DNA. Glimmer is used as theprimary gene finder tool at TIGR.

  • 8/14/2019 Identification of Gene From Databases

    57/61

    GRAIL

    This software was developed by EdUberbacher et al. at Oakridgenational laboratory. This tool

    identifies exons, polyA sites,promoters, CpG islands, repetitiveelements and frame shift errors in

    DNA sequences by comparing themto a database of known Human andMouse sequence elements.

  • 8/14/2019 Identification of Gene From Databases

    58/61

    GenScan

    GenScan was developed by ChrisBurge and Samuel Karlin atStanford University. Thisprogramme uses probabilistic

    model of gene structure that isbased on actual biologicalinformation about thetranscriptional, translational and

    splicing signals. Its high speed andaccuracy make GenScan themethod of choice for the initialanalysis of large stretches of

    eukaryotic genomic DNA. GenScan

  • 8/14/2019 Identification of Gene From Databases

    59/61

    Genebuilder

    Genebuilder performs ab initio geneprediction using numerousparameters, such as GC content, di-

    codon frequencies, splicing sitedata, CpG islands, repetitiveelements and others. It also

    performs BLAST searches ofpredicted genes against protein andEST databases.

    Sequence analysis: overview

  • 8/14/2019 Identification of Gene From Databases

    60/61

    q y

    Nucleotide sequence file

    Search databases for

    similar sequences

    Sequence comparison

    Multiple sequence analysis

    Design further experimentsRestriction mappingPCR planning

    Translate

    into protein

    Search for

    known motifs

    RNA structure

    prediction

    non-coding

    coding

    Protein sequence

    analysis

    Search for protein

    coding regions

    Manual

    sequence entry

    Sequence database

    browsing

    Sequencing project

    management

    Protein sequence file

    Search databases forsimilar sequences

    Sequence comparison

    Search forknown motifs

    Predictsecondary

    structure

    Predict

    tertiary

    structureCreate a multiple

    sequence alignment

    Edit the alignment

    Format the alignment

    for publication

    Molecular

    phylogeny

    Protein family

    analysis

    Nucleotide

    sequence

    analysis

    Sequence

    entry

  • 8/14/2019 Identification of Gene From Databases

    61/61