survey of misannotations and pseudogenes in the arabidopsis genome

Survey of Misannotations and Pseudogenes in the Arabidopsis Genome

Tanmay Prakash

Objectives

Why•Misannotation can hinder research•Pseudogenes can be used to study natural selection

Objectives•Find Possible Misannotations•Find Possible Pseudogenes

Many misannotations are the result of gene prediction programs mislabeling introns because of the presence of a stop codon

Misannotations

CDS CDSIntronUTR UTR

Pseudogenes are DNA sequences that no longer function but resemble the functional genes they once were. There are two types:•Processed•Non-processed

Common Properties of Pseudogenes•Stop Codons•Frameshift mutations•Lack of Selective Pressure

agtacatgcataggactcgatcgactc

agtacatgataggactcgatcgactc

STCIGLDRL

ST..DSID

Pseudogenes

Query Protein

Domains

SubjectArabidopsis

Introns

BLASTSearch

HMMERSearch

Query Protein

Domains

SubjectArabidopsis

GenesMatching In Introns

GenesMatching

In CDS

GenesMatchingIn Both

PossiblyMisannotated

Check forStop CodonsFrameshift

CheckKa/Ks

PossiblePseudogenes

Pipeline

Query Protein

Domains

SubjectArabidopsis

Introns

BLASTSearch

HMMERSearch

Query Protein

Domains

SubjectArabidopsis

GenesMatching In Introns

GenesMatching In Exons

GenesMatchingIn Both

PossiblyMisannotated

Results

There were 346 genes (different models not included) that had matches to the same domain in the introns and exons

There were 299 genes (different models not included) that had matches to the same domain in an intron and flanking exons. These are most likely misannotations.

Domain Possible Misannotations #DomainsPF01657.7 16 76PF02902.8 15 32PF06721.1 13 3PF07734.2 15 113

4 domains with the most possible misannotations

Domain Family Size vs Misannotations

10121416

0 500 1000 1500 2000 2500 3000

Number of Domains in Family

Series1

Misannotation Frequency

0 2000 4000 6000 8000 10000

Number of Genes Matching Domain

Domian Gene Frequentcy

0 2000 4000 6000 8000 10000

Number of genes matching Domain

Future Research

•Identify pseudogenes by looking for stop codons, and frameshift mutations in the introns and checking the Ka/Ks value•Use a more recent database of domains•Follow the same process for the rice genome

Acknowledgement

Dr. Shin-Han ShiuDr. Kosuke HanadaDr. Melissa Lehti-ShiuDr. Gail RichmondHSHSP

survey of misannotations and pseudogenes in the arabidopsis genome

fast misannotations

likely misannotations

survey of misannotations

genes different models

possible misannotationsls

functional genes

dna sequences

longer function

Documents

genome of the arabidopsis thaliana model plant

genome-wide patterns of single-feature polymorphism in...

diversity and survival strategies of ltr retrotransposons in...

the evolution of expression patterns in the arabidopsis...

comparative genomics of rice and arabidopsis. analysis of...

genes and evolution genome structure and evolution the...

einführung in die genetik - tum...segmentally duplicated...

arabidopsis genome annotation tair7 release. arabidopsis...

validation of pooled whole-genome re-sequencing in...

supplementary online material the arabidopsis lyrata ... ·...

the pcna pseudogenes in the human genome

pseudogenes y evolucion

genome-scale arabidopsis promoter array identifies targets

genome-wide characterization and analysis of sbp...

applicationsof arabidopsis genome

generation of targeted knockout mutants in arabidopsis...

arabidopsis genome annotation

cas9-based genome editing in arabidopsis and...

whirly proteins maintain plastid genome stability in...

induced genome-wide binding of three arabidopsis wrky...