survey of misannotations and pseudogenes in the arabidopsis genome
Post on 24-Jan-2016
54 Views
Preview:
DESCRIPTION
TRANSCRIPT
Survey of Misannotations and Pseudogenes in the Arabidopsis Genome
Tanmay Prakash
Objectives
Why•Misannotation can hinder research•Pseudogenes can be used to study natural selection
Objectives•Find Possible Misannotations•Find Possible Pseudogenes
Many misannotations are the result of gene prediction programs mislabeling introns because of the presence of a stop codon
Misannotations
CDS CDSIntronUTR UTR
Pseudogenes are DNA sequences that no longer function but resemble the functional genes they once were. There are two types:•Processed•Non-processed
Common Properties of Pseudogenes•Stop Codons•Frameshift mutations•Lack of Selective Pressure
agtacatgcataggactcgatcgactc
agtacatgataggactcgatcgactc
STCIGLDRL
ST..DSID
Pseudogenes
Query Protein
Domains
SubjectArabidopsis
Introns
BLASTSearch
HMMERSearch
Query Protein
Domains
SubjectArabidopsis
CDS
GenesMatching In Introns
GenesMatching
In CDS
GenesMatchingIn Both
PossiblyMisannotated
Genes
Check forStop CodonsFrameshift
CheckKa/Ks
PossiblePseudogenes
Pipeline
Query Protein
Domains
SubjectArabidopsis
Introns
BLASTSearch
HMMERSearch
Query Protein
Domains
SubjectArabidopsis
CDS
GenesMatching In Introns
GenesMatching In Exons
GenesMatchingIn Both
PossiblyMisannotated
Genes
Results
There were 346 genes (different models not included) that had matches to the same domain in the introns and exons
There were 299 genes (different models not included) that had matches to the same domain in an intron and flanking exons. These are most likely misannotations.
Domain Possible Misannotations #DomainsPF01657.7 16 76PF02902.8 15 32PF06721.1 13 3PF07734.2 15 113
4 domains with the most possible misannotations
Domain Family Size vs Misannotations
02468
10121416
0 500 1000 1500 2000 2500 3000
Number of Domains in Family
Nu
mb
er o
f M
isan
no
tati
on
s
Series1
Misannotation Frequency
0
0.1
0.2
0.3
0.4
0.5
0.6
0 2000 4000 6000 8000 10000
Number of Genes Matching Domain
Per
cen
tag
e M
isan
no
tati
on
Domian Gene Frequentcy
0
5
10
15
20
0 2000 4000 6000 8000 10000
Number of genes matching Domain
Num
ber o
f M
isan
nota
tions
Future Research
•Identify pseudogenes by looking for stop codons, and frameshift mutations in the introns and checking the Ka/Ks value•Use a more recent database of domains•Follow the same process for the rice genome
Acknowledgement
Dr. Shin-Han ShiuDr. Kosuke HanadaDr. Melissa Lehti-ShiuDr. Gail RichmondHSHSP
top related