big data nebraska
TRANSCRIPT
RIDING THE DATA
TIDAL WAVE IN
MICROBIOLOGY
Future of Big Data, Lincoln, NE 11/6/2014
Adina Howe
germslab.org (Genomics and Environmental Research in Microbial Systems)
Argonne National Laboratory / Michigan State University
Iowa State University, Ag & Biosystems Engr (January)
Slides available at www.slideshare.com/adinachuanghowe
Microbes are critical
Climate Change
Energy Supply
USGCRP 2009
www.alutiiq.com
http://guardianlv.com/
Human & Animal
Health
An understanding
of microbial ecology
Global Food
Security
Understanding community
dynamics
Who is there?
What are they doing?
How are they doing it?
Understanding community
dynamics
Who is there?
What are they doing?
How are they doing it?
Kim Lewis, 2010
Gene / Genome Sequencing
Collect samples
Extract DNA
Sequence DNA
“Analyze” DNA to identify its content and origin
Taxonomy
(e.g., pathogenic E. Coli)
Function
(e.g., degrades cellulose)
Cost of Sequencing
Stein, Genome Biology, 2010
E. Coli genome 4,500,000 bp ($4.5M, 1992)
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
Year
0.1
1
10
100
1,000
10,000
100,000
1,000,000
DN
A S
equencin
g, M
bp
per $
10,000,000
100,000,000
Rapidly decreasing costs with
NGS Sequencing
Stein, Genome Biology, 2010
Next Generation Sequencing
4,500,000 bp (E. Coli, $200, presently)
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
Year
0.1
1
10
100
1,000
10,000
100,000
1,000,000
DN
A S
equencin
g, M
bp
per $
10,000,000
100,000,000
Effects of low cost
sequencing…
First free-living bacterium sequenced
for billions of dollars and years of
analysis
Personal genome can be
mapped in a few days and
hundreds to few thousand
dollars
The experimental continuum
Single Isolate
Pure Culture
Enrichment
Mixed CulturesNatural systems
The era of big data in biology
Stein, Genome Biology, 2010
Computational Hardware
(doubling time 14 months)
Sanger Sequencing
(doubling time 19 months)
NGS (Shotgun) Sequencing
(doubling time 5 months)
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
Year
0
1
10
100
1,000
10,000
100,000
1,000,000
Dis
k S
tora
ge,
Mb/$
0.1
1
10
100
1,000
10,000
100,000
1,000,000
DN
A S
equencin
g, M
bp
per $
10,000,000
100,000,000
0.1
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
Postdoc experience with data
2003-2008 Cumulative sequencing in PhD = 2000 bp
2008-2009 Postdoc Year 1 = 50 Gbp
2009-2010 Postdoc Year 2 = 450 Gbp
2014 = 50 Tbp
2015 = 500 Tbp budgeted
THE DIRT ON SOIL
Biodiversity in the dark, Wall et al., Nature Geoscience, 2010 Jeremy Burgress
MAGNIFICENT BIODIVERSITY
THE DIRT ON SOIL
SPATIAL HETEROGENEITY
http://www.fao.org/ www.cnr.uidaho.edu
THE DIRT ON SOIL
DYNAMIC
THE DIRT ON SOIL
INTERACTIONS: BIOTIC, ABIOTIC, ABOVE, BELOW, SCALES
Philippot, 2013, Nature Reviews Microbiology
I. Technical side of microbial big data in
biology
II. Future of big data in soil microbial
communities
III. Bottlenecks for microbiologists
Tackling Soil Biodiversity
Source: Chuck Haney
C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU)
Janet Jansson, Susannah Tringe (JGI)
Lesson #1: Accessing information in
data
http://siliconangle.com/files/2010/09/image_thumb69.png
de novo assembly
Compresses dataset size significantly
Improved data quality (longer sequences, gene order)
Reference not necessary (novelty)
Raw sequencing data (“reads”) Computational algorithms Informative genes / genomes
Metagenome assembly…a scaling
problem.
Shotgun sequencing and de novo
assembly
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
Practical Challenges – Intensive
computing
Howe et al, 2014, PNAS
Months of
“computer
crunching” on a
super computer
Practical Challenges – Intensive
computing
Howe et al, 2014, PNAS
Months of
“computer
crunching” on a
super computerAssembly of 300 Gbp (70,000
genomes worth) can be done with
any assembly program in less
than 14 GB RAM and less than
24 hours.
50 Gbp = 10,000 genomes
Natural community characteristics
Diverse
Many organisms
(genomes)
Natural community characteristics
Diverse
Many organisms
(genomes)
Variable abundance
Most abundant organisms, sampled
more often
Assembly requires a minimum amount
of sampling
More sequencing, more errors
Sample 1x
Natural community characteristics
Diverse
Many organisms
(genomes)
Variable abundance
Most abundant organisms, sampled
more often
Assembly requires a minimum amount
of sampling
More sequencing, more errors
Sample 1x Sample 10x
Natural community characteristics
Diverse
Many organisms
(genomes)
Variable abundance
Most abundant organisms, sampled
more often
Assembly requires a minimum amount
of sampling
More sequencing, more errors
Sample 1x Sample 10x
Overkill
Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One
Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One
Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One
Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One
Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One
Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One
Scales datasets for assembly up to 95% - same assembly
outputs.
Genomes, mRNA-seq, metagenomes (soils, gut, water)
Tackling Soil Biodiversity
Source: Chuck Haney
C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU)
Janet Jansson, Susannah Tringe (JGI)
The reality?
More like…
Source: Chuck HaneyHowe et. al, 2014, PNAS
What we learned from deeply sequencing
soil
Grand Challenge effort –
10% of soil biodiversity
sampled
Incredible soil biodiversity
(estimate required 10
Tbp/sample)
“To boldly go where no man
has gone before”: >60%
Unknown
0
100
200
300
400
am
ino a
cid
meta
bolis
m
carb
ohydra
te m
eta
bo
lism
mem
bra
ne tra
nspo
rt
sig
nal tr
ansdu
ction
transla
tion
fold
ing
, sort
ing a
nd d
egra
da
tion
meta
bolis
m o
f co
facto
rs a
nd v
itam
ins
energ
y m
eta
bolis
m
transp
ort
and
cata
bolis
m
lipid
meta
bolis
m
tra
nscri
ption
ce
ll g
row
th a
nd
dea
th
replic
ation
and
rep
air
xen
obio
tics b
iod
egra
datio
n a
nd m
eta
bo
lism
nucle
otide m
eta
bolis
m
gly
can b
iosynth
esis
and m
eta
bolis
m
meta
bolis
m o
f te
rpenoid
s a
nd
poly
ke
tides
cell
motilit
y
Tota
l C
ount
KO
corn and prairie
corn only
prairie only
Howe et al, 2014, PNAS
Managed agriculture soils exhibit less
diversity, likely from its history of
cultivation.
If soil is so diverse, what are the most
consistent signals we can see at the plot
levels?
Is there an identifiable “soil
functional core” (carbon cycling
focus)?
Kirsten Hofmockel, Iowa State University
Ames, Iowa, COBS Field Site, Fertilized Prairie Whole Soil Samples
4 deeply sampled whole soil metagenomes (16S rRNA 5000 reads, Shotgun 20-50 million reads)
How much genetic sequence do
you think is shared in 4 replicates?
More than 1%? 10%? 50%?
What kind of genes do you expect in this core?
Minimal critical genes will be abundant & diverse
How much genetic sequence do
you think is shared in 4 replicates?
More than 1%? 10%? 50%?
What kind of genes do you expect in this core?
Minimal critical genes are varying abundances &
diverse
Core genes: soil-specific
My vision (and future research)
Microbial markers for ecosystem services
Nutrient cycling
Pathogens
Antibiotic resistance
Biodiversity
Capabilities exist already…
http://vimeo.com/90059732
Indoor Microbiome Project Animation
Is more data better?
Bottlenecks for the emerging
microbiologists
Technical obstacles in the big data
deluge
Access to the data
Access to the resources
Democratization of both data and resource access
“80% of awards and 50% of $$ are for grants < $350,000” (Ian Foster)
Data volume and velocity
Previous efforts are difficult to integrate
Innovation is necessary
Software Developers
Computer Scientists
Clinicians
PIs
Data generators
Microbiologists
Data Analyzers
Statisticians
Bioinformaticians
http://ivory.idyll.org/blog/2014-the-emerging-field-of-data-intensive-biology.html
Data intensive microbiology
Social obstacles – the main
challenge
Shift of costs do not mean shift of
expectations
http://www.deluxebattery.com/25-hilarious-expectation-vs-reality-photos/
Dear PI,
It will take longer than
the time it took you to do
your experiment to
analyze the data. Please
do not write me for
results within 24 hours of
your sequences
becoming available.
- Adina
Culture of sharing
http://www.heathershumaker.com/
Metagenomic Datasets
Training / Incentives
Emails between collaborators don’t contain as
much “science” as I’d like:
All analysis: accessible,
reproducible, and automated
All analysis: accessible,
reproducible, and automated
To reproduce analysis in a publication,
1. Rent Amazon EC2 computer
2. Clone github repository containing data and scripts
3. Open IPython notebook and execute
To run same analysis on different dataset,
1. Replace data files with your own data, execute notebook.
2. Tweak scripts as needed.
The journey in summary
RIDING THE BIG DATA
TIDAL WAVE OF MODERN
MICROBIOLOGY
Adina Howe
Argonne National Laboratory / Michigan State University
Iowa State University, Ag & Biosystems Engr (January)
“”
RIDING THE BIG DATA
TIDAL WAVE OF MODERN
MICROBIOLOGY
Adina Howe
Argonne National Laboratory / Michigan State University
Iowa State University, Ag & Biosystems Engr (January)
Acknowledgements
C. Titus Brown (MSU)
James Tiedje (MSU)
Daina Ringus (UC)
Folker Meyer (ANL)
Eugene Chang (UC)
NSF Biology Postdoc Fellowship
DOE Great Lakes Bioenergy Research Center