ensmart: a generic system for fast and flexible access to biological data arek kasprzyk et al (2004)...

20
EnsMart: A Generic EnsMart: A Generic System for Fast and System for Fast and Flexible Access to Flexible Access to Biological Data Biological Data Arek Kasprzyk Arek Kasprzyk et al et al (2004) (2004) 14:160-169, Genome 14:160-169, Genome research research EBI, Wellcome Trust EBI, Wellcome Trust

Upload: dwight-stewart

Post on 05-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EnsMart: A Generic System for Fast and Flexible Access to Biological Data Arek Kasprzyk et al (2004) 14:160-169, Genome research EBI, Wellcome Trust

EnsMart: A Generic System for EnsMart: A Generic System for Fast and Flexible Access to Fast and Flexible Access to

Biological DataBiological Data

Arek Kasprzyk Arek Kasprzyk et alet al (2004) (2004) 14:160-169, Genome research14:160-169, Genome research

EBI, Wellcome TrustEBI, Wellcome Trust

Page 2: EnsMart: A Generic System for Fast and Flexible Access to Biological Data Arek Kasprzyk et al (2004) 14:160-169, Genome research EBI, Wellcome Trust

ObjectivesObjectives

Understand the idea of a “Data Mart”Understand the idea of a “Data Mart” Understand why this idea is useful to biologyUnderstand why this idea is useful to biology Have an idea of how Have an idea of how EnsEnsMartMart works. works. Assess the significance of the EnsMart Assess the significance of the EnsMart

system. Will it last?system. Will it last?

Page 3: EnsMart: A Generic System for Fast and Flexible Access to Biological Data Arek Kasprzyk et al (2004) 14:160-169, Genome research EBI, Wellcome Trust

Data Mart definedData Mart defined

A database that is potentially derived from A database that is potentially derived from many other databases whose primary many other databases whose primary purpose is query processing and report purpose is query processing and report generation for non-technical users.generation for non-technical users.

Similar to a “Data Warehouse” Similar to a “Data Warehouse”

Marts/warehouses important components in Marts/warehouses important components in “decision support systems” in business.“decision support systems” in business.

Page 4: EnsMart: A Generic System for Fast and Flexible Access to Biological Data Arek Kasprzyk et al (2004) 14:160-169, Genome research EBI, Wellcome Trust

Data Mart in EnsMartData Mart in EnsMart

• Data collected

• Standardized

• Query Optimized

• Presented to Users

Page 5: EnsMart: A Generic System for Fast and Flexible Access to Biological Data Arek Kasprzyk et al (2004) 14:160-169, Genome research EBI, Wellcome Trust

Marts – benefitsMarts – benefits

Allows good division of labor Allows good division of labor – Computers for transactions separate from Computers for transactions separate from

computers for queriescomputers for queries– Interface development separate from database Interface development separate from database

development.development.– Biologists (can be) separated from computer Biologists (can be) separated from computer

scientists as a result of good interface design.scientists as a result of good interface design.– Produces faster more stable system for usersProduces faster more stable system for users

Page 6: EnsMart: A Generic System for Fast and Flexible Access to Biological Data Arek Kasprzyk et al (2004) 14:160-169, Genome research EBI, Wellcome Trust

CostsCosts Construction of the Mart is a challenging and Construction of the Mart is a challenging and

continuous process. continuous process. New sources of data need to be incorporated and New sources of data need to be incorporated and

validated constantly validated constantly TrustTrust

Page 7: EnsMart: A Generic System for Fast and Flexible Access to Biological Data Arek Kasprzyk et al (2004) 14:160-169, Genome research EBI, Wellcome Trust

The case for EnsMart, why now?The case for EnsMart, why now?

Growing number of different databases and Growing number of different databases and opportunities. Genomes, expression, opportunities. Genomes, expression, protein, disease…protein, disease…

Assembled, high quality genomes available.Assembled, high quality genomes available.– ““finished” genomes can be used as references finished” genomes can be used as references

to link data from different databases to link data from different databases consistently.consistently.

EnsMart built to take advantage of the EnsMart built to take advantage of the opportunities for cross-database queries.opportunities for cross-database queries.

Page 8: EnsMart: A Generic System for Fast and Flexible Access to Biological Data Arek Kasprzyk et al (2004) 14:160-169, Genome research EBI, Wellcome Trust

Inside EnsMartInside EnsMart

9 organisms9 organisms At least 17 different At least 17 different

primary sources of data, primary sources of data, many with multiple many with multiple databases.databases.

2 kinds of “Foci”2 kinds of “Foci”– GenesGenes

EnsembleEnsemble ESTEST VegaVega

– SNPsSNPs

Page 9: EnsMart: A Generic System for Fast and Flexible Access to Biological Data Arek Kasprzyk et al (2004) 14:160-169, Genome research EBI, Wellcome Trust

EnsMart schemaEnsMart schema

Focus 1

Many

Many

One

Many

Many

Page 10: EnsMart: A Generic System for Fast and Flexible Access to Biological Data Arek Kasprzyk et al (2004) 14:160-169, Genome research EBI, Wellcome Trust

EnsMart schema: another focusEnsMart schema: another focus

Page 11: EnsMart: A Generic System for Fast and Flexible Access to Biological Data Arek Kasprzyk et al (2004) 14:160-169, Genome research EBI, Wellcome Trust

Schema -> Query SpeedSchema -> Query Speed

““Central” tables or foci contain binary Central” tables or foci contain binary values for each satellite indicating values for each satellite indicating existence. First step in query generation existence. First step in query generation limits the range of satellite tables limits the range of satellite tables accessed.accessed.

These values are only useful in the query These values are only useful in the query process (take extra space and time for process (take extra space and time for transactions).transactions).

Many queries may not require access to Many queries may not require access to satellite tables as a result.satellite tables as a result.

Page 12: EnsMart: A Generic System for Fast and Flexible Access to Biological Data Arek Kasprzyk et al (2004) 14:160-169, Genome research EBI, Wellcome Trust

User InterfacesUser Interfaces

Supposedly Confucian quote Supposedly Confucian quote – "What I hear I forget. "What I hear I forget. – What I see I remember. What I see I remember. – What I do I understand." What I do I understand."

Page 13: EnsMart: A Generic System for Fast and Flexible Access to Biological Data Arek Kasprzyk et al (2004) 14:160-169, Genome research EBI, Wellcome Trust

User InterfacesUser Interfaces

MartViewMartView: website, “wizard” query : website, “wizard” query construction.construction.

MartExplorerMartExplorer: Stand alone tool, tree-based : Stand alone tool, tree-based query construction.query construction.

MartShellMartShell: text-based application that : text-based application that utilizes an SQL-like query language. Can utilizes an SQL-like query language. Can be used interactively or in batch processes.be used interactively or in batch processes.

Write your ownWrite your own! – using MartLib java library! – using MartLib java library

Page 14: EnsMart: A Generic System for Fast and Flexible Access to Biological Data Arek Kasprzyk et al (2004) 14:160-169, Genome research EBI, Wellcome Trust

MartView 1MartView 1

Mart View 1Choose org and focus

Page 15: EnsMart: A Generic System for Fast and Flexible Access to Biological Data Arek Kasprzyk et al (2004) 14:160-169, Genome research EBI, Wellcome Trust

MartView 2Design query

Page 16: EnsMart: A Generic System for Fast and Flexible Access to Biological Data Arek Kasprzyk et al (2004) 14:160-169, Genome research EBI, Wellcome Trust

MartView 3 MartView 3 Specify OutputSpecify Output

Page 17: EnsMart: A Generic System for Fast and Flexible Access to Biological Data Arek Kasprzyk et al (2004) 14:160-169, Genome research EBI, Wellcome Trust

MartExplorerMartExplorer

Page 18: EnsMart: A Generic System for Fast and Flexible Access to Biological Data Arek Kasprzyk et al (2004) 14:160-169, Genome research EBI, Wellcome Trust

MartShellMartShell

Page 19: EnsMart: A Generic System for Fast and Flexible Access to Biological Data Arek Kasprzyk et al (2004) 14:160-169, Genome research EBI, Wellcome Trust

ConclusionsConclusions

Powerful query system for biologists.Powerful query system for biologists. Useful framework for software engineers.Useful framework for software engineers.

– All open source!All open source!

What about other loci such as repetitive What about other loci such as repetitive elements?elements?

Data validation?Data validation? Annotation updates?Annotation updates?

Page 20: EnsMart: A Generic System for Fast and Flexible Access to Biological Data Arek Kasprzyk et al (2004) 14:160-169, Genome research EBI, Wellcome Trust

EnsMart DiscussionEnsMart Discussion

What, if any, are the problems with the foci What, if any, are the problems with the foci system?system?

What alternatives to this system exist?What alternatives to this system exist?

Describe a task that EnsMart could be used to Describe a task that EnsMart could be used to accomplish.accomplish.

Describe any personal experiences with Describe any personal experiences with EnsMart.EnsMart.