umr 1095 - asp umr 1095 - asp structural & comparative genomics in bread wheat triannotpipeline...
TRANSCRIPT
UMR 1095 - ASPUMR 1095 - ASP
Structural & Comparative Structural & Comparative Genomics in Bread Wheat Genomics in Bread Wheat
TriAnnotPipelineTriAnnotPipeline A LifeGrid Project based on A LifeGrid Project based on
AUVERGRIDAUVERGRID
F. Giacomoni, M. Reichstadt, P. Leroy
Génétique, Diversité & Ecophysiologie des Céréales - Clermont-Ferrand, France
3rd EGEE3rd EGEE
User ForumUser Forum
February 12th, 2008February 12th, 2008
Wheat as a challenge for GenomicsWheat as a challenge for Genomics
• Important Economic Crop
• Large Genome size
BarleyRice Bread wheat
4.800 Mb4.800 Mb
2.800 Mb2.800 Mb
380 Mb380 Mb
Maize
85%Repeat sequences70-80%
50-80%50% 140 Mb140 Mb
10%
17.000 Mb17.000 Mb
Human ~ 3.000 Mb
A. thaliana
I.N.R.A. Work on the Wheat Genome
• Sequencing • Annotating
Discover Genes Find Transposable Elements Study other biological components
AAAATCGATATAGAGTATGTAGACAAATTTTAAACCCGGGGGAGAGAGAGA DNA sequence
Results after Annotation of the DNA Sequence
EugeneEugeneGenemarkHMMGenemarkHMMGeneIDGeneID
General Pipeline Structure of TriAnnot
TriAnnot TriAnnot PipelinePipelineGRIDGRID
DataBase (DataBase (chadochado) ) & Viewers & Viewers ((GBrowseGBrowse))http://urgi.versailles.inra.fr/projects/TriAnnot/
TriSetTriSetGeneFarGeneFar
mm
ManualManualcurationcuration
training data set
GenesGenes
ManualManualcurationcuration
TEsTEs
TREPconsTREPconsREPETREPET
DNADNAsequencessequences
TEsTEs
ManualManualcurationcuration
WEB / Pipeline
Production
GBrowse Login/password
DataBanks
WEB / PipelineDevelopment
DownLoadgff/ARTEMISgameXml/APOLLO
Manual CurationAPOLLOAPOLLO
GnpDB
On
Lin
e
Login/password
Login/password
RepeatMasker, est2genome, Gmap, BLAST, HMMPfam
UpLoad Login/password Loc
al
GnpGenome
GFF
gff
Users
TriAnnotPipelineGRID ArchitectureTriAnnotPipelineGRID Architecture
GRID & Cluster
Transposable Element Transposable Element
& repeats& repeats
Panel 1Panel 1
BAC sequenceBAC sequenceFASTA formatFASTA format
BAC with masked TEBAC with masked TE
Block1aBlock1a Block1bBlock1b
BLASTx / TREPprot
TRF SSR
RepeatMaskerTREPnr,TREPtotalRepBase,
AnnotatiAnnotationon
MaskinMaskingg
Other Other biological biological target target searchessearches
Panel 3Panel 3
……
nt, sts, htgs, gss
tRNAtRNA
miRNAmiRNA
mtDNAmtDNAcpDNAcpDNA
Block5bBlock5b
Block5cBlock5c
Block5dBlock5d
BLASTnUGset / IRGSP/TIGR pseudo
Block5aBlock5a
Panel 2Panel 2Gene Gene annotationannotationGene StructureGene Structure
ab initio Prediction Prediction GeneMarkHMM, GeneID, EuGene, GENSCAN, GeneZilla
BLASTx BLASTx SwissProt / TrEMBL
BAC with masked TEs & GenesBAC with masked TEs & Genes
Block2Block2
BLAST/Gmap BLAST/Gmap with transcriptsFL-cDNA, EST, mRNA
Block3aBlock3a
Block3bBlock3b
Gene ModelGene Model
EVM + PASA EVM + PASA (US)RAP-like RAP-like (Japan)
EUGENE EUGENE (France)
Block3cBlock3c
Known Protein Known Protein
Putative ProteinPutative ProteinDomain Containing ProteinDomain Containing ProteinExpressed GeneExpressed Gene
Conserved Hypothetical GeneConserved Hypothetical GeneHypothetical GeneHypothetical Gene
Gene FunctionGene FunctionIWGSC annotation guide line
Block4Block4
Best HitBest Hitproteinsproteins - At- At - Os - Os
Bes
t Hit
Bes
t Hit
TriAnnotPipelineGRID Detailed ArchitectureTriAnnotPipelineGRID Detailed Architecture
PIPELINE PART :
WEB INTERFACE PART with:Upload of BAC FASTA format sequence
Programming parameters of the Annotation with 5 blocks
Production of a step.xml Wheat Seq
STEP_0: * 3 RepeatMasker vs 3 DataBanks
STEP_1: * 8 BLASTn vs 8 DataBanks
* 1 BLASTx vs 1 DataBank
* 1 Tandem Repeat Finder
STEP_2: * 1 EugeneIMM Rice
* 1 GeneId
* 4 GeneMarkHMM with 4 matrix
STEP_3: * 1 tBLASTx vs 1 DataBank
* 1 BLASTn vs 1 DataBank
* 1 BLASTx vs 1 DataBank
STEP_4: * 2 tBLASTn vs 2 DataBank
RESULTS FILES (GFF Format)
PIPELINE PART:
WEB INTERFACE PART with:Upload of BAC FASTA format sequence
Programming parameters of the Annotation with 5 blocks
Production of a step.xml Wheat Seq
PIPELINE_GRID PART I (STEP_1A)
PIPELINE LOCAL PART:
STEP_1B: * 1 TRF
STEP_2: * 1 EugeneIMM Rice
* 1 GeneId
* 4 GeneMarkHMM
STEP_3C: * 3 Gene Modelling
PIPELINE_GRID PART II (STEP_1B, 3A, 3B, 4A, 4B, 5A et 5D)
5 RM 3 BLASTx 8 GMap6 BLASTp 1 PFAM1 tBLASTn14 BLASTn
5 RepeatMasker (RM)
RESULTS FILES (GFF Format)
TriAnnotPipelineGRID ArchitectureTriAnnotPipelineGRID Architecture
Bioinformatic algorithmsBioinformatic algorithms
SE
Bioinformatic databasesBioinformatic algorithms
Bioinformatic package
ServerUser Interface
Server partGrid part
DB updateservice
ComputingElement (CE)
UIJDL
Bioinformatic algorithmsBioinformatic algorithms
CEUIServer
Get the parameterCreate the XML step fileGet the input (sequence) fileCreate the grid environment(JDL, shellscripts)Mask the repeated sequencesRepeatMasker/Blast/GMap/HMMerRetrieve the outputFill the database
Get the parameterCreate the XML step fileGet the input (sequence) fileCreate the grid environment(JDL, shellscripts)Mask the repeated sequencesRepeatMasker/Blast/GMap/HMMerRetrieve the outputFill the database
Get the parameterCreate the XML step fileGet the input (sequence) fileCreate the grid environment(JDL, shellscripts)Mask the repeated sequencesRepeatMasker/Blast/GMap/HMMerRetrieve the outputFill the database
ComputingElement (CE)
UIJDL
Bioinformatic algorithmsBioinformatic algorithms
CE
1-Parameters+ input file
2-Creation XML file9-DB filling
3-copy input files
4-Creation environment
6-job running (BLAST/HMMer/RepeatMasker/GMap)
5-job submission 7- job output
8-output transfer
UIJDL
2007-2007-20082008
TriAnnotPipelineGRID PartnersTriAnnotPipelineGRID Partners
F. GiacomoniF. GiacomoniC. CharpentierC. CharpentierN. GuilhotN. GuilhotF. ChouletF. ChouletP. LeroyP. LeroyC. FeuilletC. Feuillet
T. Tanaka T. Tanaka H. IkawaH. IkawaH. NumaH. NumaT. ItohT. Itoh
M. AlauxM. AlauxT. FlutreT. FlutreI. Blanc-LenfleI. Blanc-LenfleS. RebouxS. RebouxH. QuesnevilleH. Quesneville
B. HaasB. HaasF. LegeaiF. Legeai
B. KronmillerB. Kronmiller
M. ReichstadtM. ReichstadtA. ClaudeA. ClaudeM. LiauzuM. LiauzuA. MahulA. Mahul