ecfg for g ene identification using dart yuri bendana, sharon chao, karsten temme

11
ECFG for Gene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme

Upload: clifton-anthony

Post on 20-Jan-2018

212 views

Category:

Documents


0 download

DESCRIPTION

Evogene  Pedersen and Hein,  EHMM = HMM + Evolutionary Tree  Gene structure model  Region specific evolutionary models  EM and ML estimation of parameters using Baum- Welch and Powell from annotated human/mouse alignments.  Gaps in the MSA are treated as missing data  MAP estimate of gene structure using Viterbi.  Pedersen and Hein,  EHMM = HMM + Evolutionary Tree  Gene structure model  Region specific evolutionary models  EM and ML estimation of parameters using Baum- Welch and Powell from annotated human/mouse alignments.  Gaps in the MSA are treated as missing data  MAP estimate of gene structure using Viterbi.

TRANSCRIPT

Page 1: ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme

ECFG for Gene Identification using DART

Yuri Bendana, Sharon Chao, Karsten Temme

Page 2: ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme

Eukaryotic Gene Structure

Figure 4-14 from Lodish et al., Molecular Cell Biology, 2004.

Adapted from Figure 1.4 in Graur and Li, Fundamentals of Molecular Evolution, 2000.

Page 3: ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme

EvogenePedersen and Hein, 2003.EHMM = HMM + Evolutionary Tree

Gene structure modelRegion specific evolutionary modelsEM and ML estimation of parameters using

Baum-Welch and Powell from annotated human/mouse alignments.

Gaps in the MSA are treated as missing dataMAP estimate of gene structure using Viterbi.

Page 4: ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme

Evogene EHMM

• Phase 1 and 2 introns model frameshift: inner codon interrupted by the intron• Alignment column(s) are generated for each state visited• HKY/Goldman-Yang evol models used for nt/codons.

Page 5: ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme

Evogene Results

116 human/mouse orthologs used for training and testing

Prediction improves when inputting MSA versus single sequence

Page 6: ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme

DARTDNA, Amino, and RNA Tests

[Holmes]ECFG = SCFG + Evolutionary Tree

xgram, xfold, xprot programsxgram - generic grammarxfold - built-in nt grammarxprot - built-in aa grammar

Page 7: ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme

Xgram Workflow

Grammar

MSA + Tree

Xgram Annotated MSA

Page 8: ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme

Xgram Implementation Grammar Format

Terminal alphabet Markov chains Production rules for nonterminals

Null statesBifurcation statesEmit states

EM for estimating parameters for the evolutionary grammar

MAP for alignment annotations

Page 9: ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme

Xfold Codon Grammar

Start Null

Forward

Reverse

Null -> NUC Null’Null’ -> Fwd | Rev | EndFwd -> POS1 POS2 POS3 Fwd’ Codon: “0” “1” “2”Rev -> ~POS3 ~POS2 ~POS1 Rev’ Codon: “2” “1” “0”

# Stockholm 1.0 Seq1 ATGGAA…. Seq2 ATGACG….#=GC Codon 012012….210210

0 1 2

2 1 0

NUC

Page 10: ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme

Codon model extensionsAdapt to match Evogene model

Start/stop translation codonsSplicing acceptor/donor sitesFrameshift introns

Extensions to Evogene model5’ and 3’ UTRPromoter region

Page 11: ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme

Testing methodsVerify DART performance vs Evogene

Mouse/human alignments as training data

mreB/actin genes for model growthIntronsIntergenic