ensmart: a generic system for fast and flexible access to biological data arek kasprzyk et al (2004)...
TRANSCRIPT
EnsMart: A Generic System for EnsMart: A Generic System for Fast and Flexible Access to Fast and Flexible Access to
Biological DataBiological Data
Arek Kasprzyk Arek Kasprzyk et alet al (2004) (2004) 14:160-169, Genome research14:160-169, Genome research
EBI, Wellcome TrustEBI, Wellcome Trust
ObjectivesObjectives
Understand the idea of a “Data Mart”Understand the idea of a “Data Mart” Understand why this idea is useful to biologyUnderstand why this idea is useful to biology Have an idea of how Have an idea of how EnsEnsMartMart works. works. Assess the significance of the EnsMart Assess the significance of the EnsMart
system. Will it last?system. Will it last?
Data Mart definedData Mart defined
A database that is potentially derived from A database that is potentially derived from many other databases whose primary many other databases whose primary purpose is query processing and report purpose is query processing and report generation for non-technical users.generation for non-technical users.
Similar to a “Data Warehouse” Similar to a “Data Warehouse”
Marts/warehouses important components in Marts/warehouses important components in “decision support systems” in business.“decision support systems” in business.
Data Mart in EnsMartData Mart in EnsMart
• Data collected
• Standardized
• Query Optimized
• Presented to Users
Marts – benefitsMarts – benefits
Allows good division of labor Allows good division of labor – Computers for transactions separate from Computers for transactions separate from
computers for queriescomputers for queries– Interface development separate from database Interface development separate from database
development.development.– Biologists (can be) separated from computer Biologists (can be) separated from computer
scientists as a result of good interface design.scientists as a result of good interface design.– Produces faster more stable system for usersProduces faster more stable system for users
CostsCosts Construction of the Mart is a challenging and Construction of the Mart is a challenging and
continuous process. continuous process. New sources of data need to be incorporated and New sources of data need to be incorporated and
validated constantly validated constantly TrustTrust
The case for EnsMart, why now?The case for EnsMart, why now?
Growing number of different databases and Growing number of different databases and opportunities. Genomes, expression, opportunities. Genomes, expression, protein, disease…protein, disease…
Assembled, high quality genomes available.Assembled, high quality genomes available.– ““finished” genomes can be used as references finished” genomes can be used as references
to link data from different databases to link data from different databases consistently.consistently.
EnsMart built to take advantage of the EnsMart built to take advantage of the opportunities for cross-database queries.opportunities for cross-database queries.
Inside EnsMartInside EnsMart
9 organisms9 organisms At least 17 different At least 17 different
primary sources of data, primary sources of data, many with multiple many with multiple databases.databases.
2 kinds of “Foci”2 kinds of “Foci”– GenesGenes
EnsembleEnsemble ESTEST VegaVega
– SNPsSNPs
EnsMart schemaEnsMart schema
Focus 1
Many
Many
One
Many
Many
EnsMart schema: another focusEnsMart schema: another focus
Schema -> Query SpeedSchema -> Query Speed
““Central” tables or foci contain binary Central” tables or foci contain binary values for each satellite indicating values for each satellite indicating existence. First step in query generation existence. First step in query generation limits the range of satellite tables limits the range of satellite tables accessed.accessed.
These values are only useful in the query These values are only useful in the query process (take extra space and time for process (take extra space and time for transactions).transactions).
Many queries may not require access to Many queries may not require access to satellite tables as a result.satellite tables as a result.
User InterfacesUser Interfaces
Supposedly Confucian quote Supposedly Confucian quote – "What I hear I forget. "What I hear I forget. – What I see I remember. What I see I remember. – What I do I understand." What I do I understand."
User InterfacesUser Interfaces
MartViewMartView: website, “wizard” query : website, “wizard” query construction.construction.
MartExplorerMartExplorer: Stand alone tool, tree-based : Stand alone tool, tree-based query construction.query construction.
MartShellMartShell: text-based application that : text-based application that utilizes an SQL-like query language. Can utilizes an SQL-like query language. Can be used interactively or in batch processes.be used interactively or in batch processes.
Write your ownWrite your own! – using MartLib java library! – using MartLib java library
MartView 1MartView 1
Mart View 1Choose org and focus
MartView 2Design query
MartView 3 MartView 3 Specify OutputSpecify Output
MartExplorerMartExplorer
MartShellMartShell
ConclusionsConclusions
Powerful query system for biologists.Powerful query system for biologists. Useful framework for software engineers.Useful framework for software engineers.
– All open source!All open source!
What about other loci such as repetitive What about other loci such as repetitive elements?elements?
Data validation?Data validation? Annotation updates?Annotation updates?
EnsMart DiscussionEnsMart Discussion
What, if any, are the problems with the foci What, if any, are the problems with the foci system?system?
What alternatives to this system exist?What alternatives to this system exist?
Describe a task that EnsMart could be used to Describe a task that EnsMart could be used to accomplish.accomplish.
Describe any personal experiences with Describe any personal experiences with EnsMart.EnsMart.