vectorbase gene expression data in vectorbase fotis kafatos, george christophides, bob maccallum...
TRANSCRIPT
VectorBaseVectorBase
Gene expression data in VectorBase
Fotis Kafatos, George Christophides, Bob MacCallum & Seth Redmond
Imperial College London
(thanks also to EBI, Sanger and ND)
VectorBaseVectorBase
Outline
1. Project goals
2. What’s currently available
3. Current challenges and future plans
VectorBaseVectorBase
Project goals
• For vector biologists:– Easy access to gene expression data
• consistent data processing
• For array specialists:– ArrayExpress submission– Advanced analysis tools– Array annotation
VectorBaseVectorBase
BULKLOADER
EXPRESSIONDATA
STORAGE& ANALYSIS
• BASE: BioArray Software Environment
• http://base.thep.lu.se/• Open source, active
development and user community
• LIMS, data storage, export and analysis
• Web-based, user/group access control
• BASE 2.x adoption will bring Affy support
Data submission
• Community submission guidelines available• First batch of experiments loaded by us• Bulk data loader• Sample/experiment annotation requires
intervention from curators
VectorBaseVectorBase
BULKLOADER
EXPRESSIONDATA
STORAGE& ANALYSIS
ArrayExpress
‘PUBLIC’STORAGE
• Data held in BASE is largely MIAME compliant
• Script for semi-automated export in TAB2MAGE format
• One experiment submitted so far
VectorBaseVectorBase
BULKLOADER
EXPRESSIONDATA
STORAGE& ANALYSIS
ArrayExpress
‘PUBLIC’STORAGE
VectorBaseVectorBase
BULKLOADER
EXPRESSIONDATA
STORAGE& ANALYSIS
ArrayExpress
‘PUBLIC’STORAGE
DATASUMMARIES
• BASE web interface offers powerful and extendable analysis environment
• Can be used for multi-site collaborations on pre-publication data
• Steep learning curve/not 100% intuitive
• Not easily linked to• We provide simpler
views so the casual user can quickly draw biological inferences
VectorBaseVectorBase
VectorBaseVectorBase
Standardised data
All displayed data is processed in the same way:
1. Poor quality spots removed• Currently using submitted spot flags
2. Normalisation• “lowess” for two-colour experiments
VectorBaseVectorBase
VectorBaseVectorBase
BULKLOADER
EXPRESSIONDATA
STORAGE& ANALYSIS
ArrayExpress
‘PUBLIC’STORAGE
DATASUMMARIES
PROBEMAPPING
• 3 probe types
• 6 array designs
• Mapping handled via Ensembl pipeline:– Oligo exonerate– PCR e-PCR– cDNA
exonerate2genes
VectorBaseVectorBase
GENOMICDATA
AUTOMATICANNOTATION
GENOMEBROWSER
VectorBaseVectorBase
BULKLOADER
EXPRESSIONDATA
STORAGE& ANALYSIS
ArrayExpress
‘PUBLIC’STORAGE
DATASUMMARIES
PROBEMAPPING
GFF3
VectorBaseVectorBase
contigview
VectorBaseVectorBase
featureview
VectorBaseVectorBase
BULKLOADER
EXPRESSIONDATA
STORAGE& ANALYSIS
VECTOR BIOLOGISTS
ARRAY BIOLOGISTS GENOME BIOLOGISTS
ArrayExpress
‘PUBLIC’STORAGE
VectorBaseVectorBase
GENOMICDATA
AUTOMATICANNOTATION
GENOMEBROWSER
DATASUMMARIES
PROBEMAPPING
DATA MINING
VectorBaseVectorBase
BioMart
• Beta version currently available– http://base.vectorbase.org:9999/biomart/martview
• Improvements still needed:– experiment annotations– Alignments (i.e. handle split alignments)
• Federation with current marts• Integration with new data?
VectorBaseVectorBase
Current challenges and future plans
• How do you want to query?
• CVs & ontologies
• APIs
• Community submission
• Manual annotation
VectorBaseVectorBase
Querying strategy
• What do you want to query on?– Fetch all genes upregulated under condition X– Fetch all experiments with gene X and condition Y– Fetch all probes with expression similar to probe X
• All essentially boil down to:– Define probe (genes etc)
– Define significant expression• ANOVA? • Up/down-regulation WRT what?
– Define experimental conditions• Sample annotation• Experimental design
BULKLOADER
EXPRESSIONDATA
STORAGE& ANALYSIS
VECTOR BIOLOGISTS
ARRAY BIOLOGISTS GENOME BIOLOGISTS
CV / ONTOLOGY
ArrayExpress
‘PUBLIC’STORAGE
GENOMICDATA
AUTOMATICANNOTATION
GENOMEBROWSER
DATASUMMARIES
PROBEMAPPING
DATA MINING
STORAGE& ANALYSIS
‘PUBLIC’STORAGE
GENOMEBROWSER
DATASUMMARIES
DATA MINING
BULKLOADER
EXPRESSIONDATA
GENOMICDATA
AUTOMATICANNOTATION
CV / ONTOLOGY
ArrayExpress
Array API ?AE API ? e! API
MartJ / MQL
PROBEMAPPING
VectorBaseVectorBase
Array API
Perl / Java objects for retrieval / handling of array data– Dual purpose:
• Consistency & efficiency of VB expression website • Computational access to VB data for all
– Objects must be:• General, DB-independent• Compatible with pre-existing Bio API (BioPerl / BioJava)
– Nb. May be pre-existing solution:• ArrayExpress API?• BioPerl-Expression?• MAGE-OM-stk
• http://neuron.cse.nd.edu/vectorbase/index.php/Array_API_proposal
VectorBaseVectorBase
VectorBaseVectorBase
Community data submission
• Carrot? – Help with ArrayExpress submission– Analysis tools– Dissemination
• Stick? – Outreach (courses, conferences)– Networking
VectorBaseVectorBase
GE data manual annotators
• Gene-build designed arrays– Negative evidence less compelling
• EST clone-based arrays– http://tinyurl.com/vlkwo
VectorBaseVectorBase
Longer term plans
Host-parasite GE data integration & analysis
GE-clusters “upstream” regions regulatory elements, upstream TFs
RNAi phenotypes Images
VectorBaseVectorBase
VectorBaseVectorBase
VectorBaseVectorBase
CVs & ontologies
• Integrate MGED and specialist ontologies for– Body parts– Developmental stages– Disease processes– …
• Allows comparison across experiments with similar experimental conditions
BioMartMost biomarts:
• Gene-based
• Mostly ‘binary’ data– e.g. a gene either has a
signal domain or doesn’t
• Easily linked with other (gene-based) biomarts
VB Biomart:
• Probe based– Many probes not aligned
• Exp data less clear– e.g. define ‘differential
expression’
• Exports gene/trans IDs
for linking to other Marts
VectorBaseVectorBase
Clustering
• A priority?• Easy to do on reporter level within
experiments• Harder to do at gene level across all
experiments– Binary gene profile: “yes/no differentially
expressed in experiment” ?
• Amazon-style links to “genes which may have similar expression profiles”?
VectorBaseVectorBase
BASE 2.x
• Adoption delayed, now in progress
• Brings Affymetrix support
• Cleaner/modern interface
• Better API (Java)