trends in genomics
DESCRIPTION
The flood of nextgen sequencing data is changing the landscape of computation biology, pushing the need for more robust infrastructures, tools, and visualization techniques.TRANSCRIPT
Trends in Genomics: An Engineer’s Perspective
Saul A. Kravitz, PhDDecember 2009
Biggest Change: Sequencing is free
2010: Benchtop, 454 GS Junior - 70M 500bp reads/day = 35Gbp/day - Human genome = ~ 1 sequencer day, ~10k$
2002: Factory, AB3730 @ JCVI - 10k 500bp reads/sequencer/day = 5Mbp/day- Human Genome = ~ 19 sequencer yr, ~10M$
2000: Factory, AB3700 @ Celera - 1k 500bp reads/day/sequener = 0.5Mbp/day- Human Genome = ~ 190 sequencer yr, ~200M$
2010: Service, Complete Genomics- Human genome = ~ 1 day, ~1k$
2002
2010
New Bottlenecks
• Generating sequence data – free• Data Management• Data Query• Data Analysis
- Breadth: Communities
- Depth: Populations (e.g., flu, human)• Thinking is very pricy!
Same Thinking $, More DataP
roje
ct C
ost
The Crux of the Problem
• Genomic data interpreted in context- How does my genome compare to all others
- Which other proteins are similar to mine• Size of context is growing exponentially
- Growth is faster than Moore’s law• Hard to fight an exponential
- BLASTP against NCBI NR
- All against all BLASTP of microbial proteins
Bioinformatics Isn’t High Energy Physics
• Data inputs are changing rapidly- CE Chromatograms, 454 Flowgrams, Color Space
• Error models and read lengths are changing rapidly• Tools evolving rapidly
- Difficult to track many academic tools- High quality commercial platforms emerge
• Even when “cooks” use shared “ingredients” “recipes” vary widely- Faith based science
• My dataset alone has limited value• Computations are (relatively) IO Intensive
Some Solutions and Directions
• Repeated process must be automated- Even if labor is free, deviations from SOP costly
• Commercial Tools- Market has expanded, quality improved
• Tools for exploring Human Variation- The HuRef Browser
• Metagenomics Tools and Challenges- Global Ocean Sampling Expedition- Visualization tools- Metagenomic Annotation- Genome Standards Consortium and M5
• Clouds and Grids- ScaaS: Science as a Service
Personal Genomics: The future is now (ca 2008)
HuRef Browser: Accelerate thinking
• Compare 2 published genomes- Craig Venter’s Diploid Genome- Composite NCBI-36
• Are differences real? - Noisy data?- Assembly errors?- Analysis errors?
• Methods development requires curation by biologists
• As genomes accumulate, more acute challenge
HuRef Browser: http://huref.jcvi.org
Zinc Finger ProteinChr19:57564487-57581356
Assembly StructureAssembly Structure
VariationsVariations
TranscriptTranscript GeneGene
Haplotype BlocksHaplotype Blocks
NCBI-36NCBI-36
HuRefHuRefAssembly-Assembly MappingAssembly-Assembly Mapping
Homozygous SNPHomozygous SNP
Heterozygous SNPHeterozygous SNP
Protein Truncated by 476 bp Insertion
Insertion
Assembly Structure
Insertion
• Genomics – ‘Old School’- Study of a single organism's genome - Genome sequence determined using shotgun
sequencing and assembly- >1300 microbes sequenced, first in 1995 (at TIGR)
- DNA usually obtained from pure cultures (<1%) or amplication of DNA from single cells
• Metagenomics - Use genomics tricks on communities – no culturing- Environmental shotgun sequencing of DNA or RNA- Metadata provides context
Genomics vs Metagenomics
• Within an environment- What biological functions are present (absent)?- What organisms are present (absent)?
• Compare data from (dis)similar environments- What are the fundamental rules of microbial ecology
• Adapting to environmental conditions?- How do communities respond to stimuli?- How does community structure change?
• Search for novel proteins and protein families- And diversity within known families
Metagenomic Questions
Global Ocean Sampling Expedition
Global Ocean Sampling Expedition
• 178 Total Sampling Locations- Pilot: 2.0M reads 4/04- Phase 1: 7.7M reads, >6M proteins 3/07- Phase 2-IO: 2.2M reads 3/08- Phase 2: ~30M reads 2010?
• Diverse Environments- Open ocean, estuary, embayment, upwelling, fringing reef, atoll…
3/08
3/07
4/04
• Most sequence reads are unique• Very limited assembly• Most sequences not taxonomically anchored• Reference genomes a basis set? Not really.
- Several hundred isolates• Challenges
- Relating shotgun data to reference genomes- Structural and Functional Annotation
GOS: Sequence Diversity in the Ocean
Rusch et al (PLoS Biology2007)
Browsing Large Data Collections: Fragment Recruitment Viewer
• Microbial Communities vs Reference Genomes
- Millions of sequence reads vs Thousands of genomes
• Definition: A read is recruited to a sequence if:
- End-to-end blastN alignment exists• Rapid Hypothesis Generation and Exploration
- How do cultured and wildtype genomes differ?
- Insertions, deletion, translocations
- Correlation with environmental factors
Fragment Recruitment ViewerS
eque
nce
Sim
ilarit
y
Genomic Position
Doug Rusch, JCVI
Doug Rusch and Michael Press
Doug Rusch and Michael Press
• Novel clustering process• Sequence similarity based• Predict putative proteins and group into related clusters • Include GOS and all known proteins
• Findings• GOS proteins
• cover ~all existing prokaryotic families• expands diversity of known protein families
• ~10% of large clusters are novel• Many are of viral origin
• No saturation in the rate of novel protein family discovery
GOS Protein Analysis Yooseph et al (PLoS Biology 2007)
Rubisco homologs
Added Protein Family Diversity
Yooseph et al (PLoS 2007)
New Groups
GOS prokaryotes
Known eukaryotes
Known prokaryotes
Annotation ofEnvironmental Shotgun Data
• Challenges:- Lack of context- Protein fragments
• Gene Finding- Yooseph’s Protein Clusters + Metagene
• Functional Assignment- Variation of JCVI prok annotation pipeline*- Leverages protein cluster annotation -- soon
• Result:- Quality Nearly Comparable to Prokaryotic Genomic
Annotation
Protein ClustersAdvantages and Disadvantages
• Weaknesses- Homology-based- Stateful (also a strength)- Less sensitive (for now)
• Strengths- Exponential Linear?- Learns over time- Easy to maintain
Increasing the pressure
• Nextgen + Metagenomics- Deeper collections- Short sequences less informative
• How should we annotate?- When in doubt, use BLAST against NRAA, and other large and fast-
growing collections• Annotation needs growing dramatically
- 24x7 quality software- Special Hardware: FPGA? Grahics/CUDA? SIMD/SSE?- New algorithms?- Back to supercomputers?
• Sharing data and computes- Standardization of data, metadata, and computes
Folker Meyer, ANL
Science as a Service (ScaaS)
• Standard tools as services- Service-Oriented Architecture
- Supported by HPC as necessary
- Grid workflow for integration• Maintain tools & data in scalable compute
environment- Celera Assembler in the clouds
Vision for High Throughput Science
Construction of the Ark. Nuremberg Chronicle (1493).
Today:
Scientist
Vision for High Throughput Science
+
ScientistEngineers
http://freepages.genealogy.rootsweb.ancestry.com/~thegrove/gec2a.html Rodin’s Thinker
Credits
• JCVI Informatics Team
• Support- DOE
- Gordon and Betty Moore Foundation
- NIAID