preditor predictive rna editor for plant mitochondrial genes jeff mower
TRANSCRIPT
What is RNA Editing?– A process that alters the RNA sequence
– Nt insertion, deletion, or conversion
– Does not include RNA maturation processes
RNA Editing in Plants– Occurs in mitochondria and chloroplasts
– C to U and U to C conversions
– Mechanism is not known
RNA Editing in Plants– In seed plants (conifers, flowering plants, etc.)
• Widespread in mitochondrion• Rare in chloroplast• Predominantly C to U
– In non-seed plants (mosses, ferns, etc.)• Frequent in mitochondrion and chloroplast• Both C to U and U to C are common
AUG UGAAGACGGUC CAAAAUCGU UCU UGCGGCGUA
M R N S V G C Q *
UGGC
AUGAUG CAAAAUCGU UCU UGCGGCGUA
M V M R N S V G C Q *
GUC
Creation of new start codon
AG UGA UGGC
AUG UGAAGACGGUC CAAAAUCGU UCU UGCGGCGUA
M R N S V G C Q *
UGGC
AUGAUG CAAAAUUGU UUU UGCGGCGUA
M V M C N F V G C Q *
GUC
Alteration ofprotein sequence
Creation of new start codon
AG UGA UGGC
AUG UGAAGACGGUC CAAAAUCGU UCU UGCGGCGUA
M R N S V G C Q *
UGGC
AUGAUG CAAAAUUGU UUU UGCGGUGUA
M V M C N F V G C Q *
GUC
Alteration ofprotein sequence
Creation of new start codon
AG UGA UGGC
No effect onprotein sequence
AUG UGAAGACGGUC CAAAAUCGU UCU UGCGGCGUA
M R N S V G C Q *
UGGC
AUG UGAUGGCAUG UAAAAUUGU UUU UGCGGUGUA
M V M C N F V G C *
GUC
Alteration ofprotein sequence
Creation of new stop codon
Creation of new start codon
AG
No effect onprotein sequence
Identifying Edit Sites1. Determine experimentally
• Need to isolate and reverse transcribe RNA• Need multiple reads (editing is not always complete)
Identifying Edit Sites1. Determine experimentally
• Need to isolate and reverse transcribe RNA• Need multiple reads (editing is not always complete)
2. Predict based on sequence context• Upstream and downstream regions are important• Unambiguous motifs have not been identified
Identifying Edit Sites1. Determine experimentally
• Need to isolate and reverse transcribe RNA• Need multiple reads (editing is not always complete)
2. Predict based on sequence context• Upstream and downstream regions are important• Unambiguous motifs have not been identified
3. Predict based on protein conservation• Proteins are more conserved after editing • Editing tends to “correct” amino acid sequences
PREditor Methodology – Aligned sequence database (ASD) Construction
• 363 DNA sequences RNA editing information is known Organisms do not perform RNA editing
• Proteins were translated using the editing information
• Homologous proteins were aligned 42 different alignments All known mt proteins are covered
PREditor Methodology – Input Sequence Manipulation
• Accept a protein-coding DNA sequence as input• Translate input sequence• Align translation to homologous proteins in ASD
PREditor Methodology – The Underlying Principle
• ASD sequences translated from edited RNA• Input sequence translated from unedited DNA• “Where can RNA editing in the input sequence increase
conservation to the ASD sequences?”
ASD1 A T L G
ASD2 E M L G
ASD3 A M L G
Input E (GAA) T (ACG) P (CCU) G (GGC)
Unedited E (GAA) 0.33
Edited N/A N/A
ASD1 A T L G
ASD2 E M L G
ASD3 A M L G
Input E (GAA) T (ACG) P (CCU) G (GGC)
Unedited E (GAA) 0.33 T (ACG) 0.33
Edited N/A N/A M (AUG) 0.67
ASD1 A T L G
ASD2 E M L G
ASD3 A M L G
Input E (GAA) T (ACG) P (CCU) G (GGC)
Unedited E (GAA) 0.33 T (ACG) 0.33 P (CCU) 0.0
Edited N/A N/A M (AUG) 0.67 S (UCU) 0.0
L (CUU) 1.0
F (UUU) 0.0
ASD1 A T L G
ASD2 E M L G
ASD3 A M L G
Input E (GAA) T (ACG) P (CCU) G (GGC)
Unedited E (GAA) 0.33 T (ACG) 0.33 P (CCU) 0.0 G (GGC) 1.0
Edited N/A N/A M (AUG) 0.67 S (UCU) 0.0 G (GGU) 1.0
L (CUU) 1.0
F (UUU) 0.0
Performance Analysis – Remove one protein sequence from the database
– Use the unedited DNA sequence as input
– Calculate statistics • Accuracy = (TP + TN) / (TP + FP + TN + FN)• Sensitivity = TP / (TP + FN)• Specificity = TN / (TN + FP)
– Repeat for each sequence
Performance Analysis– Total # C’s = 58,982
– True edited sites = 3,548 (6.0%)• TP = 2,922• FN = 626
– True non-edited sites = 55,434• TN = 54,829• FP = 605
Performance Analysis – Sensitivity = 82.4%
• Proportion of true edited sites that were predicted correctly• Increases to 94.6% if you ignore missed silent edited sites
– Specificity = 98.9%• Proportion of true non-edited sites that were predicted correctly
– Accuracy = 97.9%• Proportion of all sites that were predicted correctly• Increases to 98.7% if you ignore missed silent edited sites
Limitations– Cannot predict editing at silent sites
• 458 of 626 FN are at silent sites
– Not a major problem in practice• Silent editing sites do not affect the protein sequence • Many silent sites are only occasionally edited• Only 13% of editing sites are silent (expect ~38%)
Limitations– Aligned sequence database is skewed
Origin of
RNA editing
Angiosperms 266 (74%)
Gymnosperms 2 (1%)
Ferns 0
Horsetails 0
Hornworts 0
Mosses 0
Liverworts 32 (9%)
Charophytes 59 (16%)
Chlorophytes ―
Ongoing Work– Reducing the skewed database effect
• Weighted sequences and phylogenetics
ASD1 Z
ASD2 Z
ASD3 Z
ASD4 Z
ASD4 X
Input X!
Ongoing Work – Making the online resource more appealing
– Making the online resource more user-friendly
Future Directions – Increase diversity in the ASD
– Analyze sequence context using the ASD
– Apply methodology to editing in chloroplasts
– Apply methodology to U to C editing