data integration team of the research unit dap - home -...
TRANSCRIPT
Data Integration Team of the research unit DAP
South Green Bioinformatics activities
at CIRAD
Manuel Ruiz, CIP, Lima, 23rd january
The Joint Research Unit DAP (Développement et Amélioration des Plantes = Plant Development and Genetic Improvement) :
The 2 main research thematics focus on genetics and plant improvement & development and adaptation.
Studied species:
rice, wheat, sorghum, sugarcane, banana, coconut, oil palm, yam, coffee, rubber tree, cocoa, cotton, apple and olive
IS in development
Haplophyle
MS-DMind
Web portal information systems (IS)
http://southgreen.cirad.fr/ (very soon)
a database that manages genetic and genomic information about tropical crops
an interactive information system for rice reverse genetics
a database for the phenotypic characterization of the Génoplante rice insertion line library
a Web portal for crossing cocoa phenotypic, genetic and genomic data
Information systems
http://tropgenedb.cirad.fr/
banana, cocoa, coconut, coffee, cotton, oil palm, rice, rubber, sugarcane, sorghum
rice
rice
cocoa
http://orygenesdb.cirad.fr/
http://urgi.versailles.inra.fr/OryzaTagLine/
http://cocoagendb.cirad.fr/
a database that manages genetic and genomic information about tropical crops
Version 1.0• genetic map• QTL data• marker : RFLP, RAPD, SSR, etc.• genotype data• phenotype data• germplasm data
http://tropgenedb.cirad.fr/
a database that manages genetic and genomic information about tropical crops
Version 1.0• genetic map• QTL data• marker : RFLP, RAPD, SSR, etc.• genotype data• phenotype data• germplasm data
http://tropgenedb.cirad.fr/ Chantal Hamelin
Xavier Argout,
Gaétan Droc
Pierre Larmande
Analysis pipeline for cDNA
Application for SSR marker development
Analysis
A methodology for genome-wide searches for orthologs in plants http://greenphyl.cirad.fr
Christophe PérinMatthieu ConteMathieu Rouard (Bioversity)
http://sat.cirad.fr/sat
http://esttik.cirad.fr (secure access)
Xavier Argout,
Xavier Argout,
Jean-François RamiClaire Billot
Plants genomes sequencingPlants genomes sequencing
species Genome size (Mb) Chromosomes number
Arabidopsis thaliana 119.2 5 Complete
Oryza sativa 389 12 Complete
Populus trichocarpa 480 19 Complete
Vitis vinifera 475 19 Complete
Chlamydomonas reinhardtii
100 17 Complete
Sorghum bicolor 760 10 Complete
Medicago truncatula 500 8 Complete
Physcomitrella patens 511 27 Complete
Solanum lycopersicum 950 12 In progress
Triticum aestivum 16500 7 In progress
Zea mays 2365 10 In progress
C Périn
Species Clone Feature Ploidy, Heterozygosity
Data Organization Manager project Status Organism Family
Musa acuminata Calcutta 4 wild banana Heterozygous diploid AA
40 BAC
Cavendish Grande Naine
Cultivated bananas are sterile, parthenocarpic, vegetatvely propagated plants
Heterozygous triploid AAA 6 BAC
Musa balbisianaPisang Klutuk Wulung
wild bananaHeterozygous diploid BB 18 BAC
Musa acuminata Pahang doubled haploid Homozygous AA 2*600 Mb (2*11Χ) Roux & D'HontComplete genome JGI Genoscope
Submitted
Saccharum hybrid
R570 sugarcane
Heterozygous dodecaploid aneuploid (spontaneum and officinarum parents)
12 BAC
International Consortium for Sugarcane Biotechnology (ICSB)
D'Hont BAC Genoscope In progress
Oryza sativa Japonica Rice is a model organism for monocotyledon
diploid 2*430 Mb (2*12Χ)
International Rice Genome Sequencing Project (IRGSP)
Droc complete genome
Done
Sorghum bicolor Sorghum diploid 2*800 Mb (2*10Χ) Rami complete genome JGI
In progress
Elaeis guineensis
Jacq African oil palm diploid 2*1000 Mb (2*16Χ) Billotte complete genome
To submit Arecaceae
Arabidopsis thaliana
Col-0 Thale cress is a model organism for dicotyledon
diploid 2*115 Mb (2*5Χ) International Collaboration
complete genome
Done Brassicaceae
Theobroma cacao Criollo Cacao tree
Homozygous diploid 2*380 Mb (2*10Χ)
international consortium of cocoa genomics
Lanaud complete genome To submit Malvaceae
Dicotyledon
Musaceae
Poaceae
Global Musa Genomics consortium
Rouard & Baurens BAC NIAS In progress
Monocotyledon
Analysis of plant genomes
Manual curation is not sufficient
Function comment fields for all proteins in Swiss-Prot over time.Baumgartner bioinformatics 2007
project 2008-2010
• Develop a platform of structural and functional annotation supported by comparative genomics
• Dedicated to plant and bio-aggressor genomes
• Allowing both automatic predictions and manual curation of genes and transposable elements
• User-friendly, generic, modular, portable, sustainable, upgradable et compatible
BIVI Spo
GDEC
Instances
Instances
Stéphanie Sidibe-Bocs
Valentin GuignonGaetan Droc
Information system model
Apollo Editors
x
http://orygenesdb.cirad.fr/cgi-bin/gbrowse/gnpannot [email protected]
“I suggest that functional predictions can be greatly improved by
focusing on how the genes became similar in sequence (i.e., evolution) rather than on the sequence similarity itself”.
Jonathan A. Eisen
1998
Find homologs using phylogenomic analysis
GreenPhylDBuse phylogenomic
method to identify homologous genes
•• Orthologous genesOrthologous genes are homologous genes that are descended are homologous genes that are descended from the last common ancestor through from the last common ancestor through speciationspeciation and most and most probablyprobably encode proteins with a encode proteins with a similar functionsimilar function in different in different speciesspecies
•• ParalogousParalogous genesgenes are referred as homologous genes that are referred as homologous genes that evolved through evolved through duplicationsduplications and and maymay encode proteins with encode proteins with more more divergent functionsdivergent functions
Arabidopsis gene
Rice gene A
Rice gene B
Orthologs
Speciation event
Paralogs
Gene duplication event
Homologs genes: orthologs and paralogs
GreenPhylDB V1.0 http://greenphyl.cirad.fr/
• Oryza sativa and Arabidopsis thaliana model plants
• Full genome available
• Gene annotation quality – TAIR gene database release 8: gene ID like ‘At1g12345’– TIGR gene database release 5: gene ID like ‘Os01g12345’
• Most of functional evidence.
Filtering procedureFiltering procedure LEON*LEON*
LowLow--complexity maskingcomplexity masking
Alignment refinementAlignment refinement
AlignmentAlignment
Alignment maskingAlignment masking
MAFFTMAFFT
CASTCAST
RascalRascal
AL2COAL2CO
Splices selection SS*SS*
Gene id indexingGene id indexing GI*GI*
FILTERING
MULTIALIGNEMENT
Genetic distance (x100)Genetic distance (x100)
Rooting tree (x100)Rooting tree (x100)
Tree construction (x100)Tree construction (x100) PHYMLPHYML
ProtDistProtDist
SDISDI
Bootstrapping Bootstrapping alignementalignement(x100)(x100)
SeqBootSeqBoot
Set Bootstrap values on Set Bootstrap values on PHYML treePHYML tree SB*SB*
OrthologsOrthologs InferenceInference DoRIODoRIO
Output: Orthologs predictions (.txt & NHX files)
TREE CONSTRUCTION
ORTHOLOGSINFERENCE Gene id indexingGene id indexing GI*GI*
Pipeline phylogénomique: GreenPhyl
30500 Arabidopsis genesTAIR
50200 rice genesTIGR
4400 phylogeneticaly analyzed gene families
GreenPhyl phylogenomic pipeline
24.000 orthologs relationships between rice and Arabidopsis Probable same function
6420 manually validated gene families
Automatic clustering procedure
1. 1. Add informationAdd information to rice or to rice or aratharath with a gene from a new species with a gene from a new species particularly studiedparticularly studied
Gene with KNOWNbiological information
?
Query
?
?
i-GOST (Iterative GreenPhyl Orthog Search Tool)
2 objectives
2. 2. Get informationGet information on a new sequenced gene using information on a new sequenced gene using information available from rice or available from rice or arabidopsisarabidopsis
Gene with UNKNOWN function
Add biological information
? Query
GreenPhylDB V2.0 … in progress Objectives
• 10 news fully sequenced genomes are now available(Populus alba, Glycine max, sorghum bicolor, Medicago truncatula, Vitis vinifera , Selaginella moellendorffii , Physcomitrella patens , Ostreococcus Tauri, Chlamydomonas reinhardtii , Cyanidioschyzon merolae )
• Why do you integrate these species?
1. Complete sequencing and gene prediction2. Will provide the complete list of plant gene families!3. Use functional information available on these species4. Reinforce phylogenomic signal and then orthologs predictions5. Have a good taxonomy sampling
GreenPhylDB V2.0 A huge database…
GreenPhylDatabase
V1.0
GreenPhylDatabase
v2.0
10 news species~300,000 sequences
2 species81,000 genes21,400 clusters6,400 genes families
~390,000 sequences~ 25,000 clusters
Functional Annotation
Annotation fonctionnelle
Knowledge modeling of the structure-function
relationships
The insulin receptor pathway
Knowledge modeling of the sequence-structure
relationships
project 2009 - 2011
GCP Generation Challenge Programme
A global consortium of crop research institutes established in 2003 with an approximately 10 year mandate to integrate comparative genomics and genetic resources molecular characterisation into plant breeding for stress tolerance, in particular, in drought-prone environments.
http://www.generationcp.org
Generation Challenge Program
Gautier Sarah
Haplophyle, Methodology development for reconstruction of Genealogies based on Haplotypes related to geographic patterns (HaploPhyle: graphical haplotype network in the light of external data)
GenDiversity is a query and analysis application combining genotyping data from diverse data sources, developed in support of diversity studies.
Data Integration
CIMMYT, CIRAD, IRRI, CIP, ICISAT, etc. Raw data
Scientists
non GCP DB
GCP DB
Software analysis
Platform
Data Integration
Platform architecture
IDID
SRGSRG
SEGSEG
DARDAR
GSGS
DGBDGB
DIADIA--PCPCGDPGDP
CINESCINES
LIRMMLIRMMO. O. GascuelGascuelI. I. MougenotMougenot
INRA, INRA, Genoscope,CNGGenoscope,CNG……
SwissprotSwissprotGMOD consortiumGMOD consortiumBiotecBiotec
((ThailandThailand))
GCP programGCP programBioversityBioversity
CIP, IRRI, CIMMYT,CIP, IRRI, CIMMYT,EMBRAPAEMBRAPA,,……
Equipes biomEquipes bioméétrie Ciradtrie Cirad
PartnershipPartnershipAgropolis
International
France
X. PerrierX. PerrierL. BaudouinL. Baudouin
ATGC: LIRMM Bioinformatics
platform
BiologicalRessources Genomics Genetic
RessourcesEvolution Sequence
analysisAnalysis of gene
expression
AgropolisAgropolis Plants Plants BioinformaticsBioinformatics((geneticsgenetics and and genomicsgenomics))
New algorithms
UMRs DAP, DIAPC, BGPI, LSTM, SPO, RPB, BIVI, LGDP
Geneticdiversity
High Power Computing
CINES, Montpellier