page 1 integrated microbial genomes (img) system victor m. markowitz frank korzeniewski krishna...
Post on 15-Jan-2016
224 views
TRANSCRIPT
Page 1
Integrated Microbial Genomes (IMG) System
Victor M. Markowitz Frank Korzeniewski Krishna Palaniappan
Ernest Szeto
Biological Data Management & Technology CenterLawrence Berkeley National Lab
Nikos C. KyrpidesNatalia N. Ivanova
Microbial Genome Analysis ProgramJoint Genome Institute
A Case Study in Biological Data Management
Different views on biological data management (VLDB 2004 Panel on Biological Data
Management)
Computer ScientistsSource of problems for database research
•Publication in database papers•Prototypes
BiologistsVehicle for rapid data analysis
•Publication in biology papers•Immediate solutions
Page 2
Biological Data Management Problem
Effective data analysis
involves combining data from multiple sources• single data type data generation & collection• multiple data types data association
in the context of inherently imprecise data
Page 3
Background: Microbial Genomes
WORLD58%
JGI13%
TIGR18%
SANGER11%
© http://www.genomesonline.org
Jan 04: 532 microbial genome projects
SANGER8%
WORLD47%
JGI20%
TIGR15%
JCVI10%
Mar 05: 847 microbial genome projects
Applications:Healthcare, environmental cleanup, agriculture, industrial processes, alternative energy production
Page 4
Microbial Genome Data Analysis Context
Page 5
Data Analysis Example: Occurrence Profiles
Key Challenges o Representing abstract concepts with experimental datao Specifying individual and composite operationso Data coherence, completeness, integration
Genome Y
y4 y3 y2 y1
??
Proteins from same cellular pathway areexpected to co-occur in the majority of organisms from a phylogenetic branch
Proteins from same cellular pathway areexpected to co-occur in the majority of organisms from a phylogenetic branch
R4 (e4)
R3 (e3)
R4 (e2)
R1 (e1)
Pathway
Genome X
Genes: x1 x2 x3 x4
??
Functionally related genes tend to cluster on chromosome
Functionally related genes tend to cluster on chromosome
Page 6
Microbial Genomes: Data Generation & Collection
Processo Raw data
• Small DNA sequence fragments• Assembled sequence fragments (contigs)• Complete (one contiguous) sequence
o Interpreted data • Gene prediction (models)• Functional prediction (annotations)
• Expert data validation (cleaning)• Expert annotations
Key Challengeso Diversity of data sources
• Differences in models, depth/breadth of annotationso Consistency of the data transformation process
Evolution & diversity ofTechnology platformsAlgorithms & parametersExperimental, data collection conditions
Evolution & diversity ofTechnology platformsAlgorithms & parametersExperimental, data collection conditions
Data Processin
g & Refineme
nt
Page 7
Data Transformation Process Example
Microbial Genome Annotation Pipeline (ORNL)
ORF Calling
ORF Calling
PreliminaryFunctionalAnnotation
PreliminaryFunctionalAnnotation Post
Post Fetch
Fetch SequenceData Files
SequenceData Files
AnnotationData Files
AnnotationData Files
IMG Loading
IMGIMG
Load Load
Report
Replace Replace
Microbial Genome Annotation Review & Correction (JGI)
Reference Genes
NR IMG
Download Data
For Review
Download Download
AnnotationData Files
AnnotationData Files
Data Review
Data CleansingFinal Review & Lock
RevisedAnnotationData Files
RevisedAnnotationData Files
Page 8
Microbial Genomes: Data Association
Organisms
Functions
Key Challenges o Data quality/precision for different types of data, sourceso Transience of identifiers, relationships
Predicted Genes
Page 9
Biological Data Management Problem Revisited
Effective data analysis involves
combining data from multiple sources
in the context of inherently imprecise data
while addressing• Data quality
– Data semantics, precision, integrity, provenance• System quality
– Comprehensibility, performance, reliability, scalability
• Development strategy – Choice of technologies – Devising (cost, time) effective solutions
Challengingin academic settings
Challengingin academic settings
Page 10
Needed: System Development Framework
Deploy System
Deploy System
Requirements Specification
Requirements Specification
RequirementExamples
Requirements Analysis
Requirements Analysis
PrototypeDatabase, Tools
PrototypeDatabase, Tools
Use ScenariosCase Studies
Data Model Abstraction
Data Model Abstraction
Definitions
Design & Planning
Plans &Schedules
Develop System*
Develop System*
DevelopmentDocuments
SystemSystem
Stages
Docs
Tools
Program Program Test Test Revise &
Refine
Revise & Refine Document
Document Final Release
Final Release
Preliminary Release
* SystemDevelopment
Time /Cost Constraints
Page 11
Requirement Analysis Example: IMG Data Analysis
Query construction
Query construction
Query results
Query resultsCollect genes of interest
Collect genes of interest
“Similar” gene analysis
“Similar” gene analysis
Chromosomal neighborhood analysis
Chromosomal neighborhood analysis
Find “unique” genes in a genome of interest Ψ0 wrt related genomes: Ψ1 , …, Ψk
Iterate
Page 12
Data Model Abstraction
Motivationo Adds precisiono Allows reasoning in an established framework
• Analogies to traditional data domain
Biological data modelingo Data warehouse concepts
• Proven technology for large scale biological data management applications
o Data Structure• Multidimensional data space
– Gene, genome, function/ pathway
o Operations• Multidimensional space selections, projections,
aggregations– Slice & dice, roll up, drill down… analogies
Page 14
Data Model Abstraction Example: IMG Operations
Gen
es
Functions/ Pathways
Genomes
Gene occurrence
profile across genomes
Gene occurrence
profile across genomes
Gene occurrence profiles across
pathways
Gene occurrence profiles across
pathways
Pathways shared by genomes
Pathways shared by genomes
Genes• “in” G1 • “in” G2 • “not in” G3 • “in” G4 • “in” G5
Genes• “in” G1 • “in” G2 • “not in” G3 • “in” G4 • “in” G5
G1 G2 G3 G4 G5
g3
g2
g1
+ + + + + + + - + + + - - - -
Page 15
Data Analysis Example: Searching for Unique Genes
parasite in horses
Causes human disease in tropical areas (melioidosis)
Page 16
Identifying Unique Genes of Interest
Genes involved in adherence and
invasion
Page 17
Exploring Unique Gene Details
Page 18
Summary
NeededEffective solutions for academic biological data
managemento Employing appropriate technologies and methodso Developed within (time, cost) constraints
IMG Case Study o System development process framework essential for
• Continuously evolving content– aiming at coherence, completeness
• Developing meaningful data analysis tools• Clarity of methods, parameters, results
o Metric for success• Community adoption and support• Increase in analysis productivity and value