trends in genomics

31
Trends in Genomics: An Engineer’s Perspective Saul A. Kravitz, PhD December 2009

Upload: saul-kravitz

Post on 26-Jan-2015

116 views

Category:

Business


3 download

DESCRIPTION

The flood of nextgen sequencing data is changing the landscape of computation biology, pushing the need for more robust infrastructures, tools, and visualization techniques.

TRANSCRIPT

Page 1: Trends In Genomics

Trends in Genomics: An Engineer’s Perspective

Saul A. Kravitz, PhDDecember 2009

Page 2: Trends In Genomics

Biggest Change: Sequencing is free

2010: Benchtop, 454 GS Junior - 70M 500bp reads/day = 35Gbp/day - Human genome = ~ 1 sequencer day, ~10k$

2002: Factory, AB3730 @ JCVI - 10k 500bp reads/sequencer/day = 5Mbp/day- Human Genome = ~ 19 sequencer yr, ~10M$

2000: Factory, AB3700 @ Celera - 1k 500bp reads/day/sequener = 0.5Mbp/day- Human Genome = ~ 190 sequencer yr, ~200M$

2010: Service, Complete Genomics- Human genome = ~ 1 day, ~1k$

2002

2010

Page 3: Trends In Genomics

New Bottlenecks

• Generating sequence data – free• Data Management• Data Query• Data Analysis

- Breadth: Communities

- Depth: Populations (e.g., flu, human)• Thinking is very pricy!

Page 4: Trends In Genomics

Same Thinking $, More DataP

roje

ct C

ost

Page 5: Trends In Genomics

The Crux of the Problem

• Genomic data interpreted in context- How does my genome compare to all others

- Which other proteins are similar to mine• Size of context is growing exponentially

- Growth is faster than Moore’s law• Hard to fight an exponential

- BLASTP against NCBI NR

- All against all BLASTP of microbial proteins

Page 6: Trends In Genomics

Bioinformatics Isn’t High Energy Physics

• Data inputs are changing rapidly- CE Chromatograms, 454 Flowgrams, Color Space

• Error models and read lengths are changing rapidly• Tools evolving rapidly

- Difficult to track many academic tools- High quality commercial platforms emerge

• Even when “cooks” use shared “ingredients” “recipes” vary widely- Faith based science

• My dataset alone has limited value• Computations are (relatively) IO Intensive

Page 7: Trends In Genomics

Some Solutions and Directions

• Repeated process must be automated- Even if labor is free, deviations from SOP costly

• Commercial Tools- Market has expanded, quality improved

• Tools for exploring Human Variation- The HuRef Browser

• Metagenomics Tools and Challenges- Global Ocean Sampling Expedition- Visualization tools- Metagenomic Annotation- Genome Standards Consortium and M5

• Clouds and Grids- ScaaS: Science as a Service

Page 8: Trends In Genomics

Personal Genomics: The future is now (ca 2008)

Page 9: Trends In Genomics

HuRef Browser: Accelerate thinking

• Compare 2 published genomes- Craig Venter’s Diploid Genome- Composite NCBI-36

• Are differences real? - Noisy data?- Assembly errors?- Analysis errors?

• Methods development requires curation by biologists

• As genomes accumulate, more acute challenge

Page 10: Trends In Genomics

HuRef Browser: http://huref.jcvi.org

Page 11: Trends In Genomics

Zinc Finger ProteinChr19:57564487-57581356

Assembly StructureAssembly Structure

VariationsVariations

TranscriptTranscript GeneGene

Haplotype BlocksHaplotype Blocks

NCBI-36NCBI-36

HuRefHuRefAssembly-Assembly MappingAssembly-Assembly Mapping

Page 12: Trends In Genomics

Homozygous SNPHomozygous SNP

Heterozygous SNPHeterozygous SNP

Protein Truncated by 476 bp Insertion

Insertion

Page 13: Trends In Genomics

Assembly Structure

Insertion

Page 14: Trends In Genomics

• Genomics – ‘Old School’- Study of a single organism's genome - Genome sequence determined using shotgun

sequencing and assembly- >1300 microbes sequenced, first in 1995 (at TIGR)

- DNA usually obtained from pure cultures (<1%) or amplication of DNA from single cells

• Metagenomics - Use genomics tricks on communities – no culturing- Environmental shotgun sequencing of DNA or RNA- Metadata provides context

Genomics vs Metagenomics

Page 15: Trends In Genomics

• Within an environment- What biological functions are present (absent)?- What organisms are present (absent)?

• Compare data from (dis)similar environments- What are the fundamental rules of microbial ecology

• Adapting to environmental conditions?- How do communities respond to stimuli?- How does community structure change?

• Search for novel proteins and protein families- And diversity within known families

Metagenomic Questions

Page 16: Trends In Genomics

Global Ocean Sampling Expedition

Page 17: Trends In Genomics

Global Ocean Sampling Expedition

• 178 Total Sampling Locations- Pilot: 2.0M reads 4/04- Phase 1: 7.7M reads, >6M proteins 3/07- Phase 2-IO: 2.2M reads 3/08- Phase 2: ~30M reads 2010?

• Diverse Environments- Open ocean, estuary, embayment, upwelling, fringing reef, atoll…

3/08

3/07

4/04

Page 18: Trends In Genomics

• Most sequence reads are unique• Very limited assembly• Most sequences not taxonomically anchored• Reference genomes a basis set? Not really.

- Several hundred isolates• Challenges

- Relating shotgun data to reference genomes- Structural and Functional Annotation

GOS: Sequence Diversity in the Ocean

Rusch et al (PLoS Biology2007)

Page 19: Trends In Genomics

Browsing Large Data Collections: Fragment Recruitment Viewer

• Microbial Communities vs Reference Genomes

- Millions of sequence reads vs Thousands of genomes

• Definition: A read is recruited to a sequence if:

- End-to-end blastN alignment exists• Rapid Hypothesis Generation and Exploration

- How do cultured and wildtype genomes differ?

- Insertions, deletion, translocations

- Correlation with environmental factors

Page 20: Trends In Genomics

Fragment Recruitment ViewerS

eque

nce

Sim

ilarit

y

Genomic Position

Doug Rusch, JCVI

Page 21: Trends In Genomics

Doug Rusch and Michael Press

Page 22: Trends In Genomics

Doug Rusch and Michael Press

Page 23: Trends In Genomics

• Novel clustering process• Sequence similarity based• Predict putative proteins and group into related clusters • Include GOS and all known proteins

• Findings• GOS proteins

• cover ~all existing prokaryotic families• expands diversity of known protein families

• ~10% of large clusters are novel• Many are of viral origin

• No saturation in the rate of novel protein family discovery

GOS Protein Analysis Yooseph et al (PLoS Biology 2007)

Page 24: Trends In Genomics

Rubisco homologs

Added Protein Family Diversity

Yooseph et al (PLoS 2007)

New Groups

GOS prokaryotes

Known eukaryotes

Known prokaryotes

Page 25: Trends In Genomics

Annotation ofEnvironmental Shotgun Data

• Challenges:- Lack of context- Protein fragments

• Gene Finding- Yooseph’s Protein Clusters + Metagene

• Functional Assignment- Variation of JCVI prok annotation pipeline*- Leverages protein cluster annotation -- soon

• Result:- Quality Nearly Comparable to Prokaryotic Genomic

Annotation

Page 26: Trends In Genomics

Protein ClustersAdvantages and Disadvantages

• Weaknesses- Homology-based- Stateful (also a strength)- Less sensitive (for now)

• Strengths- Exponential Linear?- Learns over time- Easy to maintain

Page 27: Trends In Genomics

Increasing the pressure

• Nextgen + Metagenomics- Deeper collections- Short sequences less informative

• How should we annotate?- When in doubt, use BLAST against NRAA, and other large and fast-

growing collections• Annotation needs growing dramatically

- 24x7 quality software- Special Hardware: FPGA? Grahics/CUDA? SIMD/SSE?- New algorithms?- Back to supercomputers?

• Sharing data and computes- Standardization of data, metadata, and computes

Folker Meyer, ANL

Page 28: Trends In Genomics

Science as a Service (ScaaS)

• Standard tools as services- Service-Oriented Architecture

- Supported by HPC as necessary

- Grid workflow for integration• Maintain tools & data in scalable compute

environment- Celera Assembler in the clouds

Page 29: Trends In Genomics

Vision for High Throughput Science

Construction of the Ark. Nuremberg Chronicle (1493).

Today:

Scientist

Page 30: Trends In Genomics

Vision for High Throughput Science

+

ScientistEngineers

http://freepages.genealogy.rootsweb.ancestry.com/~thegrove/gec2a.html Rodin’s Thinker

Page 31: Trends In Genomics

Credits

• JCVI Informatics Team

• Support- DOE

- Gordon and Betty Moore Foundation

- NIAID