page 1 integrated microbial genomes (img) system victor m. markowitz frank korzeniewski krishna...

17
Page 1 Integrated Microbial Genomes (IMG) System Victor M. Markowitz Frank Korzeniewski Krishna Palaniappan Ernest Szeto Biological Data Management & Technology Center Lawrence Berkeley National Lab Nikos C. Kyrpides Natalia N. Ivanova Microbial Genome Analysis Program Joint Genome Institute A Case Study in Biological Data Management Different views on biological data management (VLDB 2004 Panel on Biological Data Management) Computer Scientists Source of problems for database research •Publication in database papers •Prototypes Biologists Vehicle for rapid data analysis •Publication in biology papers •Immediate solutions

Post on 15-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Page 1 Integrated Microbial Genomes (IMG) System Victor M. Markowitz Frank Korzeniewski Krishna Palaniappan Ernest Szeto Biological Data Management & Technology

Page 1

Integrated Microbial Genomes (IMG) System

Victor M. Markowitz Frank Korzeniewski Krishna Palaniappan

Ernest Szeto

Biological Data Management & Technology CenterLawrence Berkeley National Lab

Nikos C. KyrpidesNatalia N. Ivanova

Microbial Genome Analysis ProgramJoint Genome Institute

A Case Study in Biological Data Management

Different views on biological data management (VLDB 2004 Panel on Biological Data

Management)

Computer ScientistsSource of problems for database research

•Publication in database papers•Prototypes

BiologistsVehicle for rapid data analysis

•Publication in biology papers•Immediate solutions

Page 2: Page 1 Integrated Microbial Genomes (IMG) System Victor M. Markowitz Frank Korzeniewski Krishna Palaniappan Ernest Szeto Biological Data Management & Technology

Page 2

Biological Data Management Problem

Effective data analysis

involves combining data from multiple sources• single data type data generation & collection• multiple data types data association

in the context of inherently imprecise data

Page 3: Page 1 Integrated Microbial Genomes (IMG) System Victor M. Markowitz Frank Korzeniewski Krishna Palaniappan Ernest Szeto Biological Data Management & Technology

Page 3

Background: Microbial Genomes

WORLD58%

JGI13%

TIGR18%

SANGER11%

© http://www.genomesonline.org

Jan 04: 532 microbial genome projects

SANGER8%

WORLD47%

JGI20%

TIGR15%

JCVI10%

Mar 05: 847 microbial genome projects

Applications:Healthcare, environmental cleanup, agriculture, industrial processes, alternative energy production

Page 4: Page 1 Integrated Microbial Genomes (IMG) System Victor M. Markowitz Frank Korzeniewski Krishna Palaniappan Ernest Szeto Biological Data Management & Technology

Page 4

Microbial Genome Data Analysis Context

Page 5: Page 1 Integrated Microbial Genomes (IMG) System Victor M. Markowitz Frank Korzeniewski Krishna Palaniappan Ernest Szeto Biological Data Management & Technology

Page 5

Data Analysis Example: Occurrence Profiles

Key Challenges o Representing abstract concepts with experimental datao Specifying individual and composite operationso Data coherence, completeness, integration

Genome Y

y4 y3 y2 y1

??

Proteins from same cellular pathway areexpected to co-occur in the majority of organisms from a phylogenetic branch

Proteins from same cellular pathway areexpected to co-occur in the majority of organisms from a phylogenetic branch

R4 (e4)

R3 (e3)

R4 (e2)

R1 (e1)

Pathway

Genome X

Genes: x1 x2 x3 x4

??

Functionally related genes tend to cluster on chromosome

Functionally related genes tend to cluster on chromosome

Page 6: Page 1 Integrated Microbial Genomes (IMG) System Victor M. Markowitz Frank Korzeniewski Krishna Palaniappan Ernest Szeto Biological Data Management & Technology

Page 6

Microbial Genomes: Data Generation & Collection

Processo Raw data

• Small DNA sequence fragments• Assembled sequence fragments (contigs)• Complete (one contiguous) sequence

o Interpreted data • Gene prediction (models)• Functional prediction (annotations)

• Expert data validation (cleaning)• Expert annotations

Key Challengeso Diversity of data sources

• Differences in models, depth/breadth of annotationso Consistency of the data transformation process

Evolution & diversity ofTechnology platformsAlgorithms & parametersExperimental, data collection conditions

Evolution & diversity ofTechnology platformsAlgorithms & parametersExperimental, data collection conditions

Data Processin

g & Refineme

nt

Page 7: Page 1 Integrated Microbial Genomes (IMG) System Victor M. Markowitz Frank Korzeniewski Krishna Palaniappan Ernest Szeto Biological Data Management & Technology

Page 7

Data Transformation Process Example

Microbial Genome Annotation Pipeline (ORNL)

ORF Calling

ORF Calling

PreliminaryFunctionalAnnotation

PreliminaryFunctionalAnnotation Post

Post Fetch

Fetch SequenceData Files

SequenceData Files

AnnotationData Files

AnnotationData Files

IMG Loading

IMGIMG

Load Load

Report

Replace Replace

Microbial Genome Annotation Review & Correction (JGI)

Reference Genes

NR IMG

Download Data

For Review

Download Download

AnnotationData Files

AnnotationData Files

Data Review

Data CleansingFinal Review & Lock

RevisedAnnotationData Files

RevisedAnnotationData Files

Page 8: Page 1 Integrated Microbial Genomes (IMG) System Victor M. Markowitz Frank Korzeniewski Krishna Palaniappan Ernest Szeto Biological Data Management & Technology

Page 8

Microbial Genomes: Data Association

Organisms

Functions

Key Challenges o Data quality/precision for different types of data, sourceso Transience of identifiers, relationships

Predicted Genes

Page 9: Page 1 Integrated Microbial Genomes (IMG) System Victor M. Markowitz Frank Korzeniewski Krishna Palaniappan Ernest Szeto Biological Data Management & Technology

Page 9

Biological Data Management Problem Revisited

Effective data analysis involves

combining data from multiple sources

in the context of inherently imprecise data

while addressing• Data quality

– Data semantics, precision, integrity, provenance• System quality

– Comprehensibility, performance, reliability, scalability

• Development strategy – Choice of technologies – Devising (cost, time) effective solutions

Challengingin academic settings

Challengingin academic settings

Page 10: Page 1 Integrated Microbial Genomes (IMG) System Victor M. Markowitz Frank Korzeniewski Krishna Palaniappan Ernest Szeto Biological Data Management & Technology

Page 10

Needed: System Development Framework

Deploy System

Deploy System

Requirements Specification

Requirements Specification

RequirementExamples

Requirements Analysis

Requirements Analysis

PrototypeDatabase, Tools

PrototypeDatabase, Tools

Use ScenariosCase Studies

Data Model Abstraction

Data Model Abstraction

Definitions

Design & Planning

Plans &Schedules

Develop System*

Develop System*

DevelopmentDocuments

SystemSystem

Stages

Docs

Tools

Program Program Test Test Revise &

Refine

Revise & Refine Document

Document Final Release

Final Release

Preliminary Release

* SystemDevelopment

Time /Cost Constraints

Page 11: Page 1 Integrated Microbial Genomes (IMG) System Victor M. Markowitz Frank Korzeniewski Krishna Palaniappan Ernest Szeto Biological Data Management & Technology

Page 11

Requirement Analysis Example: IMG Data Analysis

Query construction

Query construction

Query results

Query resultsCollect genes of interest

Collect genes of interest

“Similar” gene analysis

“Similar” gene analysis

Chromosomal neighborhood analysis

Chromosomal neighborhood analysis

Find “unique” genes in a genome of interest Ψ0 wrt related genomes: Ψ1 , …, Ψk

Iterate

Page 12: Page 1 Integrated Microbial Genomes (IMG) System Victor M. Markowitz Frank Korzeniewski Krishna Palaniappan Ernest Szeto Biological Data Management & Technology

Page 12

Data Model Abstraction

Motivationo Adds precisiono Allows reasoning in an established framework

• Analogies to traditional data domain

Biological data modelingo Data warehouse concepts

• Proven technology for large scale biological data management applications

o Data Structure• Multidimensional data space

– Gene, genome, function/ pathway

o Operations• Multidimensional space selections, projections,

aggregations– Slice & dice, roll up, drill down… analogies

Page 13: Page 1 Integrated Microbial Genomes (IMG) System Victor M. Markowitz Frank Korzeniewski Krishna Palaniappan Ernest Szeto Biological Data Management & Technology

Page 14

Data Model Abstraction Example: IMG Operations

Gen

es

Functions/ Pathways

Genomes

Gene occurrence

profile across genomes

Gene occurrence

profile across genomes

Gene occurrence profiles across

pathways

Gene occurrence profiles across

pathways

Pathways shared by genomes

Pathways shared by genomes

Genes• “in” G1 • “in” G2 • “not in” G3 • “in” G4 • “in” G5

Genes• “in” G1 • “in” G2 • “not in” G3 • “in” G4 • “in” G5

G1 G2 G3 G4 G5

g3

g2

g1

+ + + + + + + - + + + - - - -

Page 14: Page 1 Integrated Microbial Genomes (IMG) System Victor M. Markowitz Frank Korzeniewski Krishna Palaniappan Ernest Szeto Biological Data Management & Technology

Page 15

Data Analysis Example: Searching for Unique Genes

parasite in horses

Causes human disease in tropical areas (melioidosis)

Page 15: Page 1 Integrated Microbial Genomes (IMG) System Victor M. Markowitz Frank Korzeniewski Krishna Palaniappan Ernest Szeto Biological Data Management & Technology

Page 16

Identifying Unique Genes of Interest

Genes involved in adherence and

invasion

Page 16: Page 1 Integrated Microbial Genomes (IMG) System Victor M. Markowitz Frank Korzeniewski Krishna Palaniappan Ernest Szeto Biological Data Management & Technology

Page 17

Exploring Unique Gene Details

Page 17: Page 1 Integrated Microbial Genomes (IMG) System Victor M. Markowitz Frank Korzeniewski Krishna Palaniappan Ernest Szeto Biological Data Management & Technology

Page 18

Summary

NeededEffective solutions for academic biological data

managemento Employing appropriate technologies and methodso Developed within (time, cost) constraints

IMG Case Study o System development process framework essential for

• Continuously evolving content– aiming at coherence, completeness

• Developing meaningful data analysis tools• Clarity of methods, parameters, results

o Metric for success• Community adoption and support• Increase in analysis productivity and value