ecfg for g ene identification using dart yuri bendana, sharon chao, karsten temme
DESCRIPTION
Evogene Pedersen and Hein, EHMM = HMM + Evolutionary Tree Gene structure model Region specific evolutionary models EM and ML estimation of parameters using Baum- Welch and Powell from annotated human/mouse alignments. Gaps in the MSA are treated as missing data MAP estimate of gene structure using Viterbi. Pedersen and Hein, EHMM = HMM + Evolutionary Tree Gene structure model Region specific evolutionary models EM and ML estimation of parameters using Baum- Welch and Powell from annotated human/mouse alignments. Gaps in the MSA are treated as missing data MAP estimate of gene structure using Viterbi.TRANSCRIPT
![Page 1: ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme](https://reader036.vdocuments.site/reader036/viewer/2022090107/5a4d1c127f8b9ab0599f7c36/html5/thumbnails/1.jpg)
ECFG for Gene Identification using DART
Yuri Bendana, Sharon Chao, Karsten Temme
![Page 2: ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme](https://reader036.vdocuments.site/reader036/viewer/2022090107/5a4d1c127f8b9ab0599f7c36/html5/thumbnails/2.jpg)
Eukaryotic Gene Structure
Figure 4-14 from Lodish et al., Molecular Cell Biology, 2004.
Adapted from Figure 1.4 in Graur and Li, Fundamentals of Molecular Evolution, 2000.
![Page 3: ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme](https://reader036.vdocuments.site/reader036/viewer/2022090107/5a4d1c127f8b9ab0599f7c36/html5/thumbnails/3.jpg)
EvogenePedersen and Hein, 2003.EHMM = HMM + Evolutionary Tree
Gene structure modelRegion specific evolutionary modelsEM and ML estimation of parameters using
Baum-Welch and Powell from annotated human/mouse alignments.
Gaps in the MSA are treated as missing dataMAP estimate of gene structure using Viterbi.
![Page 4: ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme](https://reader036.vdocuments.site/reader036/viewer/2022090107/5a4d1c127f8b9ab0599f7c36/html5/thumbnails/4.jpg)
Evogene EHMM
• Phase 1 and 2 introns model frameshift: inner codon interrupted by the intron• Alignment column(s) are generated for each state visited• HKY/Goldman-Yang evol models used for nt/codons.
![Page 5: ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme](https://reader036.vdocuments.site/reader036/viewer/2022090107/5a4d1c127f8b9ab0599f7c36/html5/thumbnails/5.jpg)
Evogene Results
116 human/mouse orthologs used for training and testing
Prediction improves when inputting MSA versus single sequence
![Page 6: ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme](https://reader036.vdocuments.site/reader036/viewer/2022090107/5a4d1c127f8b9ab0599f7c36/html5/thumbnails/6.jpg)
DARTDNA, Amino, and RNA Tests
[Holmes]ECFG = SCFG + Evolutionary Tree
xgram, xfold, xprot programsxgram - generic grammarxfold - built-in nt grammarxprot - built-in aa grammar
![Page 7: ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme](https://reader036.vdocuments.site/reader036/viewer/2022090107/5a4d1c127f8b9ab0599f7c36/html5/thumbnails/7.jpg)
Xgram Workflow
Grammar
MSA + Tree
Xgram Annotated MSA
![Page 8: ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme](https://reader036.vdocuments.site/reader036/viewer/2022090107/5a4d1c127f8b9ab0599f7c36/html5/thumbnails/8.jpg)
Xgram Implementation Grammar Format
Terminal alphabet Markov chains Production rules for nonterminals
Null statesBifurcation statesEmit states
EM for estimating parameters for the evolutionary grammar
MAP for alignment annotations
![Page 9: ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme](https://reader036.vdocuments.site/reader036/viewer/2022090107/5a4d1c127f8b9ab0599f7c36/html5/thumbnails/9.jpg)
Xfold Codon Grammar
Start Null
Forward
Reverse
Null -> NUC Null’Null’ -> Fwd | Rev | EndFwd -> POS1 POS2 POS3 Fwd’ Codon: “0” “1” “2”Rev -> ~POS3 ~POS2 ~POS1 Rev’ Codon: “2” “1” “0”
# Stockholm 1.0 Seq1 ATGGAA…. Seq2 ATGACG….#=GC Codon 012012….210210
0 1 2
2 1 0
NUC
![Page 10: ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme](https://reader036.vdocuments.site/reader036/viewer/2022090107/5a4d1c127f8b9ab0599f7c36/html5/thumbnails/10.jpg)
Codon model extensionsAdapt to match Evogene model
Start/stop translation codonsSplicing acceptor/donor sitesFrameshift introns
Extensions to Evogene model5’ and 3’ UTRPromoter region
![Page 11: ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme](https://reader036.vdocuments.site/reader036/viewer/2022090107/5a4d1c127f8b9ab0599f7c36/html5/thumbnails/11.jpg)
Testing methodsVerify DART performance vs Evogene
Mouse/human alignments as training data
mreB/actin genes for model growthIntronsIntergenic