gen ant
TRANSCRIPT
-
8/3/2019 Gen Ant
1/22
5/4/12
Genome Annotation
Din
es
-
8/3/2019 Gen Ant
2/22
5/4/12
www.genomesonline.org
-
8/3/2019 Gen Ant
3/22
5/4/12
protein-coding genes, nonprotein-coding genes
easier to find than other functionalelements
why?genes are transcribedwhich meansthat we can identify them by looking atRNA
traditionally this has been done bycDNA or EST sequencing, morerecently by microarray, SAGE,MPSS, etc.
-
8/3/2019 Gen Ant
4/22
5/4/12
protein-coding genes, nonprotein-coding genes
we can also find genes ab initio usingcomputational methods
this is most suited to protein-coding genes
why?
protein-coding genes have recognizablefeatures
open reading frames (ORFs)
codon bias known transcription and translationalstart and stop motifs (promoters, 3 poly-
A sites) -
-
8/3/2019 Gen Ant
5/22
5/4/12
ab initio gene discovery
Protein-coding genes have recognizable features
We can design software to scan the genome andidentify these features
Some of these programs work quite well, especially inbacteria and simpler eukaryotes with smaller and
more compact genomesIts a lot harder for the higher eukaryotes where thereare a lot of long introns, genes can be found withinintrons of other genes, etc.
We tend to do OK finding protein coding regions,but miss a lot of non-coding 5 exons and thelike
-
8/3/2019 Gen Ant
6/22
5/4/12
ab initio gene discoveryvalidatingpredictions and refining gene models
Standard types of evidence for validation ofpredictions include:
match to previously annotated cDNA
match to EST from same organism
similarity of nucleotide or conceptuallytranslated protein sequence to sequences inGenBank
(translation works betterwhy?)
protein structure prediction match to a PFAM domain
associated with recognized promoter sequences, ie TATAbox, CpG island
known phenotype from mutation of the locus
-
8/3/2019 Gen Ant
7/22
5/4/12
Finding nonprotein-coding genes
e.g., tRNA, rRNA, snoRNA, miRNA, various otherncRNAs
Harder to find than protein-coding genes
Why?
often not poly-A taileddont end up in cDNAlibraries
no ORF
constraint on sequence divergence at nucleotidenot protein level, so homology is harder todetect
So, how do we find these?
-
8/3/2019 Gen Ant
8/22
5/4/12
Finding nonprotein-coding genes
secondary structure
homology, especially alignment of related species
experimentally
isolation through non-polyA dependent cloningmethods
microarrays
-
8/3/2019 Gen Ant
9/22
5/4/12
ab initio gene discoveryapproaches
Most gene-discovery programs makes use of someform ofmachine learning algorithm. A machinelearning algorithm requires a training setof inputdata that the computer uses to learn how to find apattern.
Two common machine learning approaches used ingene discovery (and many other bioinformaticsapplications) are artificial neural networks (ANNs)and hidden Markov models (HMMs).
-
8/3/2019 Gen Ant
10/22
5/4/12
ab initio gene discoveryHMMs
An example state diagram for an HMM for genediscovery is this simplified version of one used byGenescan:
begin
generegion
start
translation
donor
splicesite
acceptor
splicesite
stop
translation
end
generegion
single exon
exonfinalexon
initialexon
5 UTR 3 UTR
intron
Each box and arrow has associated transitionprobabilities, and emission probabilities for emissionof nucleotides (dotted arrow). These are learned fromexamples of known gene models and provide the
probability that a stretch of sequence is a gene.
A,T,G,C
adapted from Gibson and Muse,A Primer ofGenome Science
-
8/3/2019 Gen Ant
11/22
5/4/12
Despite good progress in identifying bothprotein coding and non-protein codinggenes, much work remains to be donebefore even the best-studied genomes are
fully annotated. For the higher eukaryotes,only a tiny percentage of features such asTFBSs, CRMs, and other non-gene featureshave so far been indentified.
Genome Annotationmuch work remains
-
8/3/2019 Gen Ant
12/22
5/4/121212
The value of genome sequences liesin their annotation
Annotation Characterizing genomicfeatures using computational andexperimental methods
Genes: Four levels of annotation
Gene Prediction Where are genes?
What do they look like?
Domains What do the proteins do?
Role What pathway(s) involved in?
-
8/3/2019 Gen Ant
13/22
5/4/121313
How many genes?
Consortium: 35,000 genes?
Celera: 30,000 genes?
Affymetrix: 60,000 human genes on
GeneChips?
Incyte and HGS: over 120,000 genes?
GenBank: 49,000 unique gene coding
sequences?
UniGene: > 89,000 clusters of unique
-
8/3/2019 Gen Ant
14/22
5/4/121414
Current consensus (in flux )
20-25000 protein-coding genes
19,599 known genes, another 2,188 DNA segments
predicted to be protein coding genes.
-
8/3/2019 Gen Ant
15/22
5/4/121515
How to we get from here
-
8/3/2019 Gen Ant
16/22
-
8/3/2019 Gen Ant
17/22
5/4/121717
Complete DNA segments responsible tomake functional products
Products
Proteins
Functional RNA molecules
RNAi (interfering RNA)
rRNA (ribosomal RNA)
snRNA (small nuclear)
snoRNA (small nucleolar)
tRNA (transfer RNA)
What are genes? - 1
-
8/3/2019 Gen Ant
18/22
5/4/121818
What are genes? - 2
Definition vs. dynamic concept
Consider
Prokaryotic vs. eukaryotic gene models
Introns/exons
Posttranscriptional modifications
Alternative splicing
Differential expression
Genes-in-genes
Genes-ad-genes
-
8/3/2019 Gen Ant
19/22
5/4/121919
Prokaryotic gene model: ORF-genes
Small genomes, high gene density
Haemophilus influenza genome 85% genic
Operons
One transcript, many genes
No introns.
One gene, one protein
Open reading frames
One ORF per gene
ORFs begin with start,
end with stop codon (def.)TIGR: http://www.tigr.org/tigr-scripts/CMR2/CMRGenomes.spl
NCBI: http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html
http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.tigr.org/tigr-scripts/CMR2/CMRGenomes.splhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.tigr.org/tigr-scripts/CMR2/CMRGenomes.spl -
8/3/2019 Gen Ant
20/22
5/4/122020
Eukaryotic gene model: spliced genes
n Posttranscriptional modificationu 5-CAP, polyA tail, splicing
n Open reading framesu Mature mRNA contains ORFu All internal exons contain open read-
throughu Pre-start and post-stop sequences are
UTRs
http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html -
8/3/2019 Gen Ant
21/22
5/4/122121
Expansions and Clarifications
ORFs
Start triplets stop
Prokaryotes: gene = ORF
Eukaryotes: spliced genes or ORF genes
Exons
Remain after introns have been removed
Flanking parts contain non-coding sequence(5- and 3-UTRs)
http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.htmlhttp://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html -
8/3/2019 Gen Ant
22/22
5/4/12
http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html