trends in genomics

Trends in Genomics: An Engineer’s Perspective

Saul A. Kravitz, PhDDecember 2009

Biggest Change: Sequencing is free

2010: Benchtop, 454 GS Junior - 70M 500bp reads/day = 35Gbp/day - Human genome = ~ 1 sequencer day, ~10k$

2002: Factory, AB3730 @ JCVI - 10k 500bp reads/sequencer/day = 5Mbp/day- Human Genome = ~ 19 sequencer yr, ~10M$

2000: Factory, AB3700 @ Celera - 1k 500bp reads/day/sequener = 0.5Mbp/day- Human Genome = ~ 190 sequencer yr, ~200M$

2010: Service, Complete Genomics- Human genome = ~ 1 day, ~1k$

2002

2010

New Bottlenecks

• Generating sequence data – free• Data Management• Data Query• Data Analysis

- Breadth: Communities

- Depth: Populations (e.g., flu, human)• Thinking is very pricy!

Same Thinking $, More DataP

roje

ct C

ost

The Crux of the Problem

• Genomic data interpreted in context- How does my genome compare to all others

- Which other proteins are similar to mine• Size of context is growing exponentially

- Growth is faster than Moore’s law• Hard to fight an exponential

- BLASTP against NCBI NR

- All against all BLASTP of microbial proteins

Bioinformatics Isn’t High Energy Physics

• Data inputs are changing rapidly- CE Chromatograms, 454 Flowgrams, Color Space

• Error models and read lengths are changing rapidly• Tools evolving rapidly

- Difficult to track many academic tools- High quality commercial platforms emerge

• Even when “cooks” use shared “ingredients” “recipes” vary widely- Faith based science

• My dataset alone has limited value• Computations are (relatively) IO Intensive

Some Solutions and Directions

• Repeated process must be automated- Even if labor is free, deviations from SOP costly

• Commercial Tools- Market has expanded, quality improved

• Tools for exploring Human Variation- The HuRef Browser

• Metagenomics Tools and Challenges- Global Ocean Sampling Expedition- Visualization tools- Metagenomic Annotation- Genome Standards Consortium and M5

• Clouds and Grids- ScaaS: Science as a Service

Personal Genomics: The future is now (ca 2008)

HuRef Browser: Accelerate thinking

• Compare 2 published genomes- Craig Venter’s Diploid Genome- Composite NCBI-36

• Are differences real? - Noisy data?- Assembly errors?- Analysis errors?

• Methods development requires curation by biologists

• As genomes accumulate, more acute challenge

HuRef Browser: http://huref.jcvi.org

Zinc Finger ProteinChr19:57564487-57581356

Assembly StructureAssembly Structure

VariationsVariations

TranscriptTranscript GeneGene

Haplotype BlocksHaplotype Blocks

NCBI-36NCBI-36

HuRefHuRefAssembly-Assembly MappingAssembly-Assembly Mapping

Homozygous SNPHomozygous SNP

Heterozygous SNPHeterozygous SNP

Protein Truncated by 476 bp Insertion

Insertion

Assembly Structure

Insertion

• Genomics – ‘Old School’- Study of a single organism's genome - Genome sequence determined using shotgun

sequencing and assembly- >1300 microbes sequenced, first in 1995 (at TIGR)

- DNA usually obtained from pure cultures (<1%) or amplication of DNA from single cells

• Metagenomics - Use genomics tricks on communities – no culturing- Environmental shotgun sequencing of DNA or RNA- Metadata provides context

Genomics vs Metagenomics

• Within an environment- What biological functions are present (absent)?- What organisms are present (absent)?

• Compare data from (dis)similar environments- What are the fundamental rules of microbial ecology

• Adapting to environmental conditions?- How do communities respond to stimuli?- How does community structure change?

• Search for novel proteins and protein families- And diversity within known families

Metagenomic Questions

Global Ocean Sampling Expedition

Global Ocean Sampling Expedition

• 178 Total Sampling Locations- Pilot: 2.0M reads 4/04- Phase 1: 7.7M reads, >6M proteins 3/07- Phase 2-IO: 2.2M reads 3/08- Phase 2: ~30M reads 2010?

• Diverse Environments- Open ocean, estuary, embayment, upwelling, fringing reef, atoll…

3/08

3/07

4/04

• Most sequence reads are unique• Very limited assembly• Most sequences not taxonomically anchored• Reference genomes a basis set? Not really.

- Several hundred isolates• Challenges

- Relating shotgun data to reference genomes- Structural and Functional Annotation

GOS: Sequence Diversity in the Ocean

Rusch et al (PLoS Biology2007)

Browsing Large Data Collections: Fragment Recruitment Viewer

• Microbial Communities vs Reference Genomes

- Millions of sequence reads vs Thousands of genomes

• Definition: A read is recruited to a sequence if:

- End-to-end blastN alignment exists• Rapid Hypothesis Generation and Exploration

- How do cultured and wildtype genomes differ?

- Insertions, deletion, translocations

- Correlation with environmental factors

Fragment Recruitment ViewerS

eque

nce

Sim

ilarit

y

Genomic Position

Doug Rusch, JCVI

Doug Rusch and Michael Press

• Novel clustering process• Sequence similarity based• Predict putative proteins and group into related clusters • Include GOS and all known proteins

• Findings• GOS proteins

• cover ~all existing prokaryotic families• expands diversity of known protein families

• ~10% of large clusters are novel• Many are of viral origin

• No saturation in the rate of novel protein family discovery

GOS Protein Analysis Yooseph et al (PLoS Biology 2007)

Rubisco homologs

Added Protein Family Diversity

Yooseph et al (PLoS 2007)

New Groups

GOS prokaryotes

Known eukaryotes

Known prokaryotes

Annotation ofEnvironmental Shotgun Data

• Challenges:- Lack of context- Protein fragments

• Gene Finding- Yooseph’s Protein Clusters + Metagene

• Functional Assignment- Variation of JCVI prok annotation pipeline*- Leverages protein cluster annotation -- soon

• Result:- Quality Nearly Comparable to Prokaryotic Genomic

Annotation

Protein ClustersAdvantages and Disadvantages

• Weaknesses- Homology-based- Stateful (also a strength)- Less sensitive (for now)

• Strengths- Exponential Linear?- Learns over time- Easy to maintain

Increasing the pressure

• Nextgen + Metagenomics- Deeper collections- Short sequences less informative

• How should we annotate?- When in doubt, use BLAST against NRAA, and other large and fast-

growing collections• Annotation needs growing dramatically

- 24x7 quality software- Special Hardware: FPGA? Grahics/CUDA? SIMD/SSE?- New algorithms?- Back to supercomputers?

• Sharing data and computes- Standardization of data, metadata, and computes

Folker Meyer, ANL

Science as a Service (ScaaS)

• Standard tools as services- Service-Oriented Architecture

- Supported by HPC as necessary

- Grid workflow for integration• Maintain tools & data in scalable compute

environment- Celera Assembler in the clouds

Vision for High Throughput Science

Construction of the Ark. Nuremberg Chronicle (1493).

Today:

Scientist

http://en.wikipedia.org/wiki/Nuremberg_Chronicle

Vision for High Throughput Science

+

ScientistEngineers

http://freepages.genealogy.rootsweb.ancestry.com/~thegrove/gec2a.html Rodin’s Thinker

http://freepages.genealogy.rootsweb.ancestry.com/~thegrove/gec2a.html

http://freepages.genealogy.rootsweb.ancestry.com/~thegrove/gec2a.html

Credits

• JCVI Informatics Team

• Support- DOE

- Gordon and Betty Moore Foundation

- NIAID

trends in genomics

Business