identification of gene from databases
TRANSCRIPT
-
8/14/2019 Identification of Gene From Databases
1/61
Identification of
Genes fromDatabasesPresented to
Dr.C.G. JoshiProfessor
Department of Animal Biotechnology,College of Veterinary Science & A.H.,Anand
Presented byPatel Hiren M
MVsc(Anim .Biotechnology)
-
8/14/2019 Identification of Gene From Databases
2/61
Introduction
Evolution is somewhat conservative
Evolution seems to have often involved the
duplication and divergence of gene Certain sequence may indicate a certain
function
A structural set of data held in a computer The structures of new genes are constantly
adding
-
8/14/2019 Identification of Gene From Databases
3/61
Databases
Require Some Basic Knowledge
What is gene
What is gene structure
- Prokaryotes
- Eukaryotes
ORFs (Open Reading Frame) cDNA library
Human genome project
-
8/14/2019 Identification of Gene From Databases
4/61
Prokaryote Genome
In prokaryotes ,introns are less common and
genes often contain a single uninterrupted
stretch of DNA, called a cistron, that codes
for a product.
These functionally related genes are in
clustered and can be transcribed together on
same mRNA
-
8/14/2019 Identification of Gene From Databases
5/61
-
8/14/2019 Identification of Gene From Databases
6/61
Eukaryotic Genes
Much more complex than in prokaryotes.
Large genomes (0.1 to 3 billion bases)
A typical mammalian cell has 1,500 times asmuch DNA than the cell ofE. Coli.
Low coding density (
-
8/14/2019 Identification of Gene From Databases
7/61
Gene Structure Eukaryotes
-
8/14/2019 Identification of Gene From Databases
8/61
Data Mining
Development of new tools for datamining
Sequence alignment
Genome sequencing
Genome comparison
Micro array data analysis
Proteomics data analysis
Small molecular array analysis
-
8/14/2019 Identification of Gene From Databases
9/61
What is a database?
A database is a collection ofinformation stored in a computer in a
systematic way, such that acomputer program can consult it toanswer questions
The software used to manage andquery a database is known as adatabase management system(DBMS)
The properties of database systems
-
8/14/2019 Identification of Gene From Databases
10/61
Annotation Forms
PubMed
Select articles that are notcited on RegulonDB
Read and select abstracts withtranscriptional information
REGULONDB
Enter data into databasethrough annotation forms
Read complete articlesabout conditions
Read completegeneral articles
Articledatabase
Selected
abstracts
Is this a general
article?
Keywords
Search
Found
Articles
Articles Classification
-
8/14/2019 Identification of Gene From Databases
11/61
Annotation formsRegistration Format
-
8/14/2019 Identification of Gene From Databases
12/61
Annotation forms
-
8/14/2019 Identification of Gene From Databases
13/61
What makes a gooddatabase?
Quality
Manual (slow)
No overlapbetween entries
Reliable
Some datamight be missing
Coverage
Automatic (fast)
Overlappingentries
Errors, biases
Up-to-date
-
8/14/2019 Identification of Gene From Databases
14/61
for Gene
Identification Find candidate genes for the trait
(time and cost!)
-What genes are there?
-How gene are expressed
-What do they do?
-How could they play a role inthe disease
-Gene synonyms
-Gene location
-
8/14/2019 Identification of Gene From Databases
15/61
GENE SCORING
-
8/14/2019 Identification of Gene From Databases
16/61
DATA SOURCES
PubMed Conserved Domain Database
Conserved Domain Database
GeneAtlas
dbSNP
Links to above-mentioned databases:Gene: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genePubMed: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMedCDD: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cddHomologene:
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=homologeneGeneAtlas: http://wombat.gnf.org/
-
8/14/2019 Identification of Gene From Databases
17/61
Gene Databases
Once a genome is in place, it isdesirable to study the regions thatmake a particular organism what it
is.
One such resource is located in thegenetic regions of the organism,
Several databases of genes andrelated structures exist.
Such database is the RefSeq
database curated at NCBI.
-
8/14/2019 Identification of Gene From Databases
18/61
Classification of Databases
Primary sequence databases
Secondary sequence databases
Genomic database resource
-
8/14/2019 Identification of Gene From Databases
19/61
Secondary sequence databases
Unigene:-Historical use for selecting sequences for micro
array. The TIGR Gene Indices:-
The gene indices at the institute for geneticresearch are arranged according to species.The TIGR GI covers 19 animal species, 18 plant
species and 7 fungal species. TIGR also include fullinformation about splice variance in the database.
Ref Seq. (NCBI References Sequence Project):-The Ref Seq. aims to collect sequences of
genomes, complete chromosome genomic regions;mRNAs & other type of RNA.
-
8/14/2019 Identification of Gene From Databases
20/61
What are genomedatabases?
Genome databases contain, well,genomic information collected frommany sources.
Genome assembly Gene predictions
Known genes, mRNA, ESTs, proteins
Genetic maps, markers andpolymorphisms
Gene expression and phenotypes
Annotations
Interspecies homologues
-
8/14/2019 Identification of Gene From Databases
21/61
Genomic Database Resource
Ensembl
- http://www.ensembl.org
19 species
UCSC Genome Browser- http://genome.ucsc.edu/
28 species (Insects!)
NCBI MapViewer- http://www.ncbi.nlm.nih.gov/mapview/
38 species (Plants, Fungi!)
-
8/14/2019 Identification of Gene From Databases
22/61
Queries to Ensembl
-
8/14/2019 Identification of Gene From Databases
23/61
Browsegenome
Quicksearch
-
8/14/2019 Identification of Gene From Databases
24/61
Quick search results
-
8/14/2019 Identification of Gene From Databases
25/61
Geneviewlink
-
8/14/2019 Identification of Gene From Databases
26/61
Gene View
-
8/14/2019 Identification of Gene From Databases
27/61
-
8/14/2019 Identification of Gene From Databases
28/61
Various Gene Databases
are: SNP Resources
- dbSNP: database of single nucleotide
polymorphismshttp://www.ncbi.nlm.nih.gov/
SNP/
- SNP Consortiumdatabase:http://snp.cshl.org/
rSNP
Guide:http://util.bionet/nsc.ru/database
-
8/14/2019 Identification of Gene From Databases
29/61
SNPs at NCBI
-
8/14/2019 Identification of Gene From Databases
30/61
Select SNPdatabase
Free text query goes
tohere
-
8/14/2019 Identification of Gene From Databases
31/61
Accession numberfor the
SNP, and species.
Location of the SNP inthesequence
Links toalternative
views
-
8/14/2019 Identification of Gene From Databases
32/61
Various Gene Databasesare:
EST Resources
- ESTs are expressed sequence tags,which are partial copies of mRNAfound within a particular cell.
- Information from ESTs can be usedto tell the splicing patterns of
genes,he occurrence of genes, etc.
dbESThttp://www.ncbi.nlm.nih.gov/dbEST/
-
8/14/2019 Identification of Gene From Databases
33/61
-
8/14/2019 Identification of Gene From Databases
34/61
Various Gene Databasesare:
Protein Databases
-The process of the central dogma statesthat DNA gets coded into RNA, which inturn gets turned into proteins.
- Since proteins code for genes, it isimportant to store known information aboutproteins inside of databases.
- There are many different proteindatabases, many of them dealing withspecific protein families.
Databases for curated proteins include:
InterPro: Protein families and domainshtt ://www.ebi.ac.uk/inter ro
-
8/14/2019 Identification of Gene From Databases
35/61
Various Gene Databasesare:
Protein Sequence Motifs(Domains)
- CDD:
http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
- eMOTIF:
http://motif.stanford.edu/emotif/- Pfam:http://www.sanger.ac.uk/Software/Pfa
Structure Databases
-
http://www.sanger.ac.uk/Software/Pfam/http://www.sanger.ac.uk/Software/Pfam/ -
8/14/2019 Identification of Gene From Databases
36/61
-
8/14/2019 Identification of Gene From Databases
37/61
-
8/14/2019 Identification of Gene From Databases
38/61
Various Gene Databasesare:
Gene Expression Databases- Once the location and sequence of
genes is known, the next step is todetermine their function.- Various biological experiments can be
performed on gene data, including thenewer microarray technology which wewill cover in class
- Databases containing the results ofthis experimental data are available
KEGG http://www.genome.ad.jp/kegg/
Klotho http://www.ibc.wustl.edu/klotho
-
8/14/2019 Identification of Gene From Databases
39/61
Comparison of Sequence
against Sequence Database The most commonly used programmes for
comparing an unknown sequence against
the sequence the database are BLAST,
FASTA.
BLAST and FASTA are derivatives of
Smith - Watermann Algorithm.
-
8/14/2019 Identification of Gene From Databases
40/61
The FASTA algorithm
Developed by Lipman and Pearson 1985
First program to search sequence
databases for gapped local alignment The best scoring local region is given as
output
It is an approximate heuristic algorithmused to compute sub-optimal pair wisesimilarity.
http://www-nbrf.georgetown.edu/pirwww/s
40
http://www-nbrf.georgetown.edu/pirwww/search/fasta.htmlhttp://www-nbrf.georgetown.edu/pirwww/search/fasta.html -
8/14/2019 Identification of Gene From Databases
41/61
BLAST
BLASTn
Megablast
Nucleotide querry BLASTx Protein querry tBLASTn,
Nucleotide querry tBLASTx
Conserved Domains RPS-BLAST,CDART
Pairwise BLAST 2BLAST
41
-
8/14/2019 Identification of Gene From Databases
42/61
What is BLAST?
42
BLAST (Basic LocalAlignment Search Tool) is aset of similarity search programs designed to
explore all of the available sequence databases
regardless of whether the query is protein or
DNA.
local means it searches and aligns sequence
segments, rather than align the entire sequence.
Its able to detect relationships among sequenceswhich share only isolated regions of similarity.
Currently, it is the most popular and most
accepted sequence analysis tool.
Wh BLAST?
-
8/14/2019 Identification of Gene From Databases
43/61
Why BLAST?
43
Identify unknown sequences - The best way to
identify an unknown sequence is to see if thatsequence already exists in a public database. If
the database sequence is a well-characterized
sequence, then you may have access to a wealth
of biological information.
Help gene/protein function and structure
prediction genes with similar sequences tend to
share similar functions or structure.
Identify protein family group related (paralog or
ortholog) genes and their proteins into a family.
Prepare sequences for multiple alignments
-
8/14/2019 Identification of Gene From Databases
44/61
Go to BLAST
-
8/14/2019 Identification of Gene From Databases
45/61
Go tonucleotide
BLAST
-
8/14/2019 Identification of Gene From Databases
46/61
Bl t 1
-
8/14/2019 Identification of Gene From Databases
47/61
Blast 1
47
Blast 2
-
8/14/2019 Identification of Gene From Databases
48/61
Blast 2
48
Blast limit by
-
8/14/2019 Identification of Gene From Databases
49/61
Blast limit bytaxon
49
Bl t lt
-
8/14/2019 Identification of Gene From Databases
50/61
Blast results
50
Interpret BLAST results
-
8/14/2019 Identification of Gene From Databases
51/61
Interpret BLAST results -Distribution
51
This image shows the distribution of BLAST hits on the query sequence. Each
line represents a hit. The span of a line represents the region where similarity is
detected. Different colors represent different ranges of scores.
Query sequence
BLAST hits. Click
to access the
pairwise
alignment.
Interpret BLAST results
-
8/14/2019 Identification of Gene From Databases
52/61
Interpret BLAST results -Description
52
ID (GI #, refseq #, DB-specific
ID #) Click to access therecord in GenBank
Bit score higher, better.Click to access the
pairwise alignment
Expect value lower, better. It
tells the possibility that this isa random hit
Gene/sequence
Definition
The description (also called definition) lines are listed below under the
heading "Sequences producing significant alignments". The term"significant" simply refers to all those hits whose E value was less than the
threshold. It does not imply biological significance.
Link
s
I t t BLAST lt
-
8/14/2019 Identification of Gene From Databases
53/61
Interpret BLAST results pairwise alingments
53
Query line: the segment from query sequence.
Subj line: the segment from hit (subject) sequence.
Middle line: the consensus bases
SOFTWARES FOR
-
8/14/2019 Identification of Gene From Databases
54/61
SOFTWARES FOR
IDENTIFICATION OF GENES
Some computational tools that are most
commonly used for gene prediction
Gene MarkGlimmer M
GRAIL
GenScanGenebuilder
G M k
-
8/14/2019 Identification of Gene From Databases
55/61
Gene Mark
This software was developed by MarkBorodovsky and James Mc Ininch.
This is used for finding prokaryotic
genes. This software employs non-homogenous markov model toclassify DNA regions into protein
coding, non-coding sequences
-
8/14/2019 Identification of Gene From Databases
56/61
Glimmer
This software was developed bySteven Salzberg et al. at JohnsHopkins University and TIGR.
Glimmer uses interpolated markovmodels to identify coding regionsand distinguish them from non-
coding DNA. Glimmer is used as theprimary gene finder tool at TIGR.
-
8/14/2019 Identification of Gene From Databases
57/61
GRAIL
This software was developed by EdUberbacher et al. at Oakridgenational laboratory. This tool
identifies exons, polyA sites,promoters, CpG islands, repetitiveelements and frame shift errors in
DNA sequences by comparing themto a database of known Human andMouse sequence elements.
-
8/14/2019 Identification of Gene From Databases
58/61
GenScan
GenScan was developed by ChrisBurge and Samuel Karlin atStanford University. Thisprogramme uses probabilistic
model of gene structure that isbased on actual biologicalinformation about thetranscriptional, translational and
splicing signals. Its high speed andaccuracy make GenScan themethod of choice for the initialanalysis of large stretches of
eukaryotic genomic DNA. GenScan
-
8/14/2019 Identification of Gene From Databases
59/61
Genebuilder
Genebuilder performs ab initio geneprediction using numerousparameters, such as GC content, di-
codon frequencies, splicing sitedata, CpG islands, repetitiveelements and others. It also
performs BLAST searches ofpredicted genes against protein andEST databases.
Sequence analysis: overview
-
8/14/2019 Identification of Gene From Databases
60/61
q y
Nucleotide sequence file
Search databases for
similar sequences
Sequence comparison
Multiple sequence analysis
Design further experimentsRestriction mappingPCR planning
Translate
into protein
Search for
known motifs
RNA structure
prediction
non-coding
coding
Protein sequence
analysis
Search for protein
coding regions
Manual
sequence entry
Sequence database
browsing
Sequencing project
management
Protein sequence file
Search databases forsimilar sequences
Sequence comparison
Search forknown motifs
Predictsecondary
structure
Predict
tertiary
structureCreate a multiple
sequence alignment
Edit the alignment
Format the alignment
for publication
Molecular
phylogeny
Protein family
analysis
Nucleotide
sequence
analysis
Sequence
entry
-
8/14/2019 Identification of Gene From Databases
61/61