big data nebraska

RIDING THE DATA

TIDAL WAVE IN

MICROBIOLOGY

Future of Big Data, Lincoln, NE 11/6/2014

Adina Howe

germslab.org (Genomics and Environmental Research in Microbial Systems)

Argonne National Laboratory / Michigan State University

Iowa State University, Ag & Biosystems Engr (January)

Slides available at www.slideshare.com/adinachuanghowe

Microbes are critical

Climate Change

Energy Supply

USGCRP 2009

www.alutiiq.com

http://guardianlv.com/

Human & Animal

Health

An understanding

of microbial ecology

Global Food

Security

Understanding community

dynamics

Who is there?

What are they doing?

How are they doing it?

Understanding community

dynamics

Who is there?

What are they doing?

How are they doing it?

Kim Lewis, 2010

Gene / Genome Sequencing

Collect samples

Extract DNA

Sequence DNA

“Analyze” DNA to identify its content and origin

Taxonomy

(e.g., pathogenic E. Coli)

Function

(e.g., degrades cellulose)

Cost of Sequencing

Stein, Genome Biology, 2010

E. Coli genome 4,500,000 bp ($4.5M, 1992)

1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012

Year

0.1

1

10

100

1,000

10,000

100,000

1,000,000

DN

A S

equencin

g, M

bp

per $

10,000,000

100,000,000

Rapidly decreasing costs with

NGS Sequencing


Next Generation Sequencing

4,500,000 bp (E. Coli, $200, presently)

1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012

Year

0.1

1

10

100

1,000

10,000

100,000

1,000,000

DN

A S

equencin

g, M

bp

per $

10,000,000

100,000,000

Effects of low cost

sequencing…

First free-living bacterium sequenced

for billions of dollars and years of

analysis

Personal genome can be

mapped in a few days and

hundreds to few thousand

dollars

The experimental continuum

Single Isolate

Pure Culture

Enrichment

Mixed CulturesNatural systems

The era of big data in biology


Computational Hardware

(doubling time 14 months)

Sanger Sequencing


NGS (Shotgun) Sequencing


1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012

Year

0

1

10

100

1,000

10,000

100,000

1,000,000

Dis

k S

tora

ge,

Mb/$

0.1

1

10

100

1,000

10,000

100,000

1,000,000

DN

A S

equencin

g, M

bp

per $

10,000,000

100,000,000

0.1

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

Postdoc experience with data

2003-2008 Cumulative sequencing in PhD = 2000 bp

2008-2009 Postdoc Year 1 = 50 Gbp

2009-2010 Postdoc Year 2 = 450 Gbp

2014 = 50 Tbp

2015 = 500 Tbp budgeted

THE DIRT ON SOIL

Biodiversity in the dark, Wall et al., Nature Geoscience, 2010 Jeremy Burgress

MAGNIFICENT BIODIVERSITY

THE DIRT ON SOIL

SPATIAL HETEROGENEITY

http://www.fao.org/ www.cnr.uidaho.edu

THE DIRT ON SOIL

DYNAMIC

THE DIRT ON SOIL

INTERACTIONS: BIOTIC, ABIOTIC, ABOVE, BELOW, SCALES

Philippot, 2013, Nature Reviews Microbiology

I. Technical side of microbial big data in

biology

II. Future of big data in soil microbial

communities

III. Bottlenecks for microbiologists

Tackling Soil Biodiversity

Source: Chuck Haney

C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU)

Janet Jansson, Susannah Tringe (JGI)

Lesson #1: Accessing information in

data

http://siliconangle.com/files/2010/09/image_thumb69.png

de novo assembly

Compresses dataset size significantly

Improved data quality (longer sequences, gene order)

Reference not necessary (novelty)

Raw sequencing data (“reads”) Computational algorithms Informative genes / genomes

Metagenome assembly…a scaling

problem.

Shotgun sequencing and de novo

assembly

It was the Gest of times, it was the wor

, it was the worst of timZs, it was the

isdom, it was the age of foolisXness

, it was the worVt of times, it was the

mes, it was Ahe age of wisdom, it was th

It was the best of times, it Gas the wor

mes, it was the age of witdom, it was th

isdom, it was tIe age of foolishness

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness

Practical Challenges – Intensive

computing

Howe et al, 2014, PNAS

Months of

“computer

crunching” on a

super computer

Practical Challenges – Intensive

computing


Months of

“computer

crunching” on a

super computerAssembly of 300 Gbp (70,000

genomes worth) can be done with

any assembly program in less

than 14 GB RAM and less than

24 hours.

50 Gbp = 10,000 genomes

Natural community characteristics

Diverse

Many organisms

(genomes)


Diverse

Many organisms

(genomes)

Variable abundance

Most abundant organisms, sampled

more often

Assembly requires a minimum amount

of sampling

More sequencing, more errors

Sample 1x


Diverse

Many organisms

(genomes)

Variable abundance


more often


of sampling


Sample 1x Sample 10x


Diverse

Many organisms

(genomes)

Variable abundance


more often


of sampling


Sample 1x Sample 10x

Overkill

Digital normalization

Brown et al., 2012, arXiv

Howe et al., 2014, PNAS

Zhang et al., 2014, PLOS One

Digital normalization

Brown et al., 2012, arXiv

Howe et al., 2014, PNAS

Zhang et al., 2014, PLOS One

Scales datasets for assembly up to 95% - same assembly

outputs.

Genomes, mRNA-seq, metagenomes (soils, gut, water)

Tackling Soil Biodiversity

Source: Chuck Haney

C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU)

Janet Jansson, Susannah Tringe (JGI)

The reality?

More like…

Source: Chuck HaneyHowe et. al, 2014, PNAS

What we learned from deeply sequencing

soil

Grand Challenge effort –

10% of soil biodiversity

sampled

Incredible soil biodiversity

(estimate required 10

Tbp/sample)

“To boldly go where no man

has gone before”: >60%

Unknown

0

100

200

300

400

am

ino a

cid

meta

bolis

m

carb

ohydra

te m

eta

bo

lism

mem

bra

ne tra

nspo

rt

sig

nal tr

ansdu

ction

transla

tion

fold

ing

, sort

ing a

nd d

egra

da

tion

meta

bolis

m o

f co

facto

rs a

nd v

itam

ins

energ

y m

eta

bolis

m

transp

ort

and

cata

bolis

m

lipid

meta

bolis

m

tra

nscri

ption

ce

ll g

row

th a

nd

dea

th

replic

ation

and

rep

air

xen

obio

tics b

iod

egra

datio

n a

nd m

eta

bo

lism

nucle

otide m

eta

bolis

m

gly

can b

iosynth

esis

and m

eta

bolis

m

meta

bolis

m o

f te

rpenoid

s a

nd

poly

ke

tides

cell

motilit

y

Tota

l C

ount

KO

corn and prairie

corn only

prairie only


Managed agriculture soils exhibit less

diversity, likely from its history of

cultivation.

If soil is so diverse, what are the most

consistent signals we can see at the plot

levels?

Is there an identifiable “soil

functional core” (carbon cycling

focus)?

Kirsten Hofmockel, Iowa State University

Ames, Iowa, COBS Field Site, Fertilized Prairie Whole Soil Samples

4 deeply sampled whole soil metagenomes (16S rRNA 5000 reads, Shotgun 20-50 million reads)

How much genetic sequence do

you think is shared in 4 replicates?

More than 1%? 10%? 50%?

What kind of genes do you expect in this core?

Minimal critical genes will be abundant & diverse

How much genetic sequence do

you think is shared in 4 replicates?

More than 1%? 10%? 50%?

What kind of genes do you expect in this core?

Minimal critical genes are varying abundances &

diverse

Core genes: soil-specific

My vision (and future research)

Microbial markers for ecosystem services

Nutrient cycling

Pathogens

Antibiotic resistance

Biodiversity

Capabilities exist already…

http://vimeo.com/90059732

Indoor Microbiome Project Animation

Is more data better?

Bottlenecks for the emerging

microbiologists

Technical obstacles in the big data

deluge

Access to the data

Access to the resources

Democratization of both data and resource access

“80% of awards and 50% of $$ are for grants < $350,000” (Ian Foster)

Data volume and velocity

Previous efforts are difficult to integrate

Innovation is necessary

Software Developers

Computer Scientists

Clinicians

PIs

Data generators

Microbiologists

Data Analyzers

Statisticians

Bioinformaticians

http://ivory.idyll.org/blog/2014-the-emerging-field-of-data-intensive-biology.html

Data intensive microbiology

Social obstacles – the main

challenge

Shift of costs do not mean shift of

expectations

http://www.deluxebattery.com/25-hilarious-expectation-vs-reality-photos/

Dear PI,

It will take longer than

the time it took you to do

your experiment to

analyze the data. Please

do not write me for

results within 24 hours of

your sequences

becoming available.

- Adina

Culture of sharing

http://www.heathershumaker.com/

Metagenomic Datasets

Training / Incentives

Emails between collaborators don’t contain as

much “science” as I’d like:

All analysis: accessible,

reproducible, and automated

All analysis: accessible,

reproducible, and automated

To reproduce analysis in a publication,

1. Rent Amazon EC2 computer

2. Clone github repository containing data and scripts

3. Open IPython notebook and execute

To run same analysis on different dataset,

1. Replace data files with your own data, execute notebook.

2. Tweak scripts as needed.

The journey in summary

RIDING THE BIG DATA

TIDAL WAVE OF MODERN

MICROBIOLOGY

Adina Howe



“”

RIDING THE BIG DATA

TIDAL WAVE OF MODERN

MICROBIOLOGY

Adina Howe



Acknowledgements

C. Titus Brown (MSU)

James Tiedje (MSU)

Daina Ringus (UC)

Folker Meyer (ANL)

Eugene Chang (UC)

NSF Biology Postdoc Fellowship

DOE Great Lakes Bioenergy Research Center

big data nebraska

Science

genome biology

coli genome

understanding microbes

effects of low cost

cost of sequencingstein

simple questions

microbiologyfuture of

organizersbig data new