2014 nyu-bio-talk

72
SCALABLE APPROACHES TO EXPLORING MICROBIAL DIVERSITY C. Titus Brown [email protected] Asst Professor, MMG / CSE; Michigan State University 1/15: Population Health & Reproduction, VetMed, UC Davis Talk slides on slideshare.net/c.titus.brown

Upload: ctitusbrown

Post on 02-Jul-2015

130 views

Category:

Science


1 download

DESCRIPTION

Talk on sequence analysis at NYU CGSB

TRANSCRIPT

Page 1: 2014 nyu-bio-talk

SCALABLE APPROACHES

TO EXPLORING

MICROBIAL DIVERSITY

C. Titus Brown

[email protected]

Asst Professor, MMG / CSE; Michigan State University

1/15: Population Health & Reproduction, VetMed, UC Davis

Talk slides on slideshare.net/c.titus.brown

Page 2: 2014 nyu-bio-talk

Funding and motivation:

Page 3: 2014 nyu-bio-talk

The central question of my lab --

How can we most effectively use computation to extract

information from large sequence data sets, for the purpose

of better understanding non- and semi-model organisms?

Focus on environmental microbes, marine animals,

& agricultural and veterinary animals.

Page 4: 2014 nyu-bio-talk

Biology is becoming data rich – and a

rising tide lifts all boats!

http://susieinfrance.blogspot.com/2010/06/rising-tide-lifts-all-boats.html

Page 5: 2014 nyu-bio-talk

…but sometimes the tide comes in a bit

fast.

Page 6: 2014 nyu-bio-talk

Our foil for today:

Investigating soil microbial communities

Life on earth depends on soil microbes, but:

• 95% or more of soil microbes cannot be cultured in lab.

• Very little transport in soil and sediment =>

slow mixing rates.

• Estimates of immense diversity:

• Billions of microbial cells per gram of soil.

• Million+ microbial species per gram of soil (Gans et al, 2005)

• One observed lower bound for genomic sequence complexity =>

26 Gbp (Amazon Rain Forest Microbial Observatory)

Page 7: 2014 nyu-bio-talk

N. A. Krasil'nikov, SOIL MICROORGANISMS AND HIGHER PLANTS

http://www.soilandhealth.org/01aglibrary/010112krasil/010112krasil.ptII.h

tml

“By 'soil' we understand (Vil'yams, 1931) a loose surface

layer of earth capable of yielding plant crops. In the physical

sense the soil represents a complex disperse system

consisting of three phases: solid, liquid, and gaseous.”

Microbes live in & on:

• Surfaces of

aggregate particles;

• Pores within

microaggregates;

Page 8: 2014 nyu-bio-talk

Specific questions to address:

• Role of soil microbes in nutrient cycling?

• How does agricultural soil differ from native soil?

• How do soil microbial communities respond to climate

perturbation?

• Genome-level questions:

• What kind of strain-level heterogeneity is present in the population?

• What are the phage and viral populations & dynamics thereof?

• What species are where, and how much is shared between

different geographical locations?

Page 9: 2014 nyu-bio-talk

Must use culture independent and

metagenomic approaches• Many reasons why you can’t or don’t want to culture:

Cross-feeding, niche specificity, dormancy, etc.

• If you want to get at underlying function, 16s analysis

alone is not sufficient.

Single-cell sequencing & shotgun metagenomics are two

common ways to investigate complex microbial communities.

Page 10: 2014 nyu-bio-talk

Shotgun metagenomics

• Collect samples;

• Extract DNA;

• Feed into sequencer;

• Computationally analyze.

Wikipedia: Environmental shotgun

sequencing.png

“Sequence it all and let the

bioinformaticians sort it

out”

Page 11: 2014 nyu-bio-talk

Computational reconstruction of

(meta)genomic content.

http://eofdreams.com/library.html;

http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/;

http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/

Page 12: 2014 nyu-bio-talk

Points:

• Lots of fragments needed! (Deep sampling.)

• Having read and understood some books will help quite a bit

(Reference genomes.)

• Rare books will be harder to reconstruct than common books.

• Errors in OCR process matter quite a bit. (Sequencing error)

• The more, different specialized libraries you sample, the more

likely you are to discover valid correlations between topics and

books. (We don’t understand most microbial function.)

• A categorization system would be an invaluable but not

infallible guide to book topics. (Phylogeny can guide

interpretation.)

• Understanding the language would help you validate &

understand the books.

Page 13: 2014 nyu-bio-talk

Great Prairie Grand Challenge --SAMPLING LOCATIONS

2008

Page 14: 2014 nyu-bio-talk

A “Grand Challenge” dataset (DOE/JGI)

0

100

200

300

400

500

600

Iowa,

Continuous

corn

Iowa, Native

Prairie

Kansas,

Cultivated

corn

Kansas,

Native

Prairie

Wisconsin,

Continuous

corn

Wisconsin,

Native

Prairie

Wisconsin,

Restored

Prairie

Wisconsin,

Switchgrass

Ba

sep

air

s of

Seq

uen

cin

g (

Gb

p)

GAII HiSeq

Rumen (Hess et. al, 2011), 268 Gbp

MetaHIT (Qin et. al, 2011), 578 Gbp

NCBI nr database,

37 Gbp

Total: 1,846 Gbp soil metagenome

Rumen K-mer Filtered,

111 Gbp

Page 15: 2014 nyu-bio-talk

A “Grand Challenge” dataset (DOE/JGI)

0

100

200

300

400

500

600

Iowa,

Continuous

corn

Iowa, Native

Prairie

Kansas,

Cultivated

corn

Kansas,

Native

Prairie

Wisconsin,

Continuous

corn

Wisconsin,

Native

Prairie

Wisconsin,

Restored

Prairie

Wisconsin,

Switchgrass

Ba

sep

air

s of

Seq

uen

cin

g (

Gb

p)

GAII HiSeq

Rumen (Hess et. al, 2011), 268 Gbp

MetaHIT (Qin et. al, 2011), 578 Gbp

NCBI nr database,

37 Gbp

Total: 1,846 Gbp soil metagenome

Rumen K-mer Filtered,

111 Gbp

Page 16: 2014 nyu-bio-talk

My algorithm research: 3 methods.

1. Adaptation of a suite of probabilistic data structures for

representing set membership and counting (Bloom filters

and CountMin Sketch). (Zhang et al., PLoS One, 2014.)

2. An online streaming approach to lossy compression of

sequencing data. (Brown et al., arXiv, 2012; Howe et al., PNAS, 2014.)

3. Compressible de Bruijn graph representation for

assembly. (Pell et al., PNAS, 2012.)

Page 17: 2014 nyu-bio-talk

Method #2 - Digital normalization(a computational version of library normalization)

Suppose you have a

dilution factor of A (10) to

B(1). To get 10x of B you

need to get 100x of A!

Overkill!!

This 100x will consume

disk space and, because

of errors, memory.

We can discard it for

you…

Page 18: 2014 nyu-bio-talk

Digital normalization

Page 19: 2014 nyu-bio-talk

Digital normalization

Page 20: 2014 nyu-bio-talk

Digital normalization

Page 21: 2014 nyu-bio-talk

Digital normalization

Page 22: 2014 nyu-bio-talk

Digital normalization

Page 23: 2014 nyu-bio-talk

Digital normalization

Page 24: 2014 nyu-bio-talk

Putting it in perspective:

Total equivalent of ~1200 bacterial genomes

Human genome ~3 billion bp

Assembling Iowa prairie and Iowa corn:

Total

Assembly

Total Contigs

(> 300 bp)

% Reads

Assembled

Predicted

protein

coding

2.5 bill 4.5 mill 19% 5.3 mill

3.5 bill 5.9 mill 22% 6.8 mill

Adina Howe

Page 25: 2014 nyu-bio-talk

Resulting contigs are all low coverage.

Figure11: Coverage (median basepair) dist ribut ion of assembled cont igs from soil metagenomes.

20

Howe et al., 2014

Page 26: 2014 nyu-bio-talk

Corn Prairie

Iowa prairie & corn DNA abundances are

very even.

Howe et al., 2014

Page 27: 2014 nyu-bio-talk

Assembly is a good idea:

Howe et al., 2014

Page 28: 2014 nyu-bio-talk

Howe et al., 2014

Analyses of

metabolic potential

begin to illuminate

differences.

Page 29: 2014 nyu-bio-talk

We see little strain variation in sample.Top tw

o a

llele

fre

quencie

s

Position within contig

Of 5000 most

abundant

contigs, only 1

has a

polymorphism

rate > 5%

Can measure

by read

mapping.

Page 30: 2014 nyu-bio-talk

Biogeography: Iowa sample overlap?

Corn and prairie content graphs have 51% nucleotide

overlap.

Corn Prairie

Suggests that at greater depth, samples may have similar

genomic content.

Page 31: 2014 nyu-bio-talk

Biogeography of genomic DNA in soil

How much genomic richness is shared

between different sites?

Qingpeng Zhang

Page 32: 2014 nyu-bio-talk

So, for soil:

• We really do need more data;

• But at least now we can assemble what we already have.

• Estimate required sequencing depth at 50 Tbp;

• Now also have 2-8 Tbp from Amazon Rain Forest

Microbial Observatory.

• …still not saturated coverage, but getting closer.

Iowa soil work has been published:

Howe et al., 2014, PNAS.

Page 33: 2014 nyu-bio-talk

So, for soil:

Note! There are now much faster assembly approaches…!

See: Megahit, http://arxiv.org/abs/1409.7208

(Technology marches on!)

Page 34: 2014 nyu-bio-talk

So, for soil:

• We really do need more data;

• But at least now we can assemble what we already have.

• Estimate required sequencing depth at 50 Tbp;

• Now also have 2-8 Tbp from Amazon Rain Forest

Microbial Observatory.

• …still not saturated coverage, but getting closer.

But, diginorm approach turns out to also be widely

useful.

Page 35: 2014 nyu-bio-talk

Digital normalization is popular…

Estimated ~1000 users of our software.

Diginorm algorithm now included in Trinity

software from Broad Institute (~10,000 users)

Illumina TruSeq long-read technology now

incorporates our approach (~100,000 users)

Page 36: 2014 nyu-bio-talk

The data problem: Looking forward 5

years…

Navin et al., 2011

Page 37: 2014 nyu-bio-talk

Some basic math:

• 1000 single cells from a tumor…

• …sequenced to 40x haploid coverage with Illumina…

• …yields 120 Gbp each cell…

• …or 120 Tbp of data.

• HiSeq X10 can do the sequencing in ~3 weeks.

• The variant calling will require 2,000 CPU weeks…

• …so, given ~2,000 computers, can do this all in one

month.

Page 38: 2014 nyu-bio-talk

Similar math applies:

• Pathogen detection in blood;

• Environmental sequencing;

• Sequencing rare DNA from circulating blood.

• Two issues:

• Volume of data & compute

infrastructure;

• Latency for clinical applications.

Page 39: 2014 nyu-bio-talk

We face an infinite data problem.

• For all intents and purposes

• For example, Illumina estimates that 228,000 human

genomes will be resequenced this year, primarily by

researchers; this is only going to grow.

• Similar stories across all of biology (although #s lower :)

Page 40: 2014 nyu-bio-talk

Current analysis approaches are multipass,

e.g. variant calling:

Mapping

Data

Sorting

Calling Answer

On infinite data, you really only want to look at the data once…

Page 41: 2014 nyu-bio-talk

Streaming algorithms can be very efficient

1-pass

Data

Answer

See also eXpress, Roberts et al., 2013.

Page 42: 2014 nyu-bio-talk

Some key points --

• Digital normalization is streaming.

• Digital normalizing is computationally efficient (lower

memory than other approaches; parallelizable/multicore;

single-pass)

• Currently, primarily used for prefiltering for assembly, but

relies on underlying abstraction (De Bruijn graph) that is

also used in variant calling.

Page 43: 2014 nyu-bio-talk

Digital normalization

Page 44: 2014 nyu-bio-talk

Digital normalization

Page 45: 2014 nyu-bio-talk

Digital normalization

Page 46: 2014 nyu-bio-talk

Digital normalization

Page 47: 2014 nyu-bio-talk

Digital normalization

Page 48: 2014 nyu-bio-talk

Some key points --

• Digital normalization is streaming.

• Digital normalizing is computationally efficient (lower

memory than other approaches; parallelizable/multicore;

single-pass)

• Currently, primarily used for prefiltering for assembly, but

relies on underlying abstraction (De Bruijn graph) that is

also used in variant calling.

Page 49: 2014 nyu-bio-talk

Error correction as the solution for our ills

Current work: error correction (??)

Errors in sequencing data are at the root of many

problems:

• Assembly is 100x lower memory in the absence of errors.

• Mapping is computationally trivial when there are no

errors.

• Variant calling and genotyping become simple, as does

species detection.

Page 50: 2014 nyu-bio-talk

We can error correct high-coverage shotgun data

with k-mer spectra:

Chaisson et al., 2009

Erroneous k-mers

True k-mers

Page 51: 2014 nyu-bio-talk

Streaming error correction on E. coli data

1% error rate, 100x coverage.

Michael Crusoe, Jordan Fish, Jason Pell

TP FP TN FN

Error correction 3,494,631 3,865 460,601,171 5,533

(corrected) (mistakes) (OK) (missed)

(Early days…)

Page 52: 2014 nyu-bio-talk
Page 53: 2014 nyu-bio-talk
Page 54: 2014 nyu-bio-talk

Single pass, reference free, tunable, streaming

online variant calling.

Error correction variant calling

Page 55: 2014 nyu-bio-talk

Streaming with reads…

Sequence...

Graph

Sequence...

Sequence...

Sequence...

Sequence...

Sequence...

Sequence...

Sequence...

....

Variants

Page 56: 2014 nyu-bio-talk

Analysis is done after sequencing.

Sequencing Analysis

Page 57: 2014 nyu-bio-talk

Streaming with bases

k bases...

Graph

k+1

k bases... k+1

k bases... k+1

k bases... k+1

k bases... k+1

k bases... k+1

k+2

...

Variants

Page 58: 2014 nyu-bio-talk

Integrate sequencing and analysis

Sequencing

Analysis

Are we done yet?

Page 59: 2014 nyu-bio-talk

What does the future hold?

• More emphasis on training and infrastructure.

• Data integration!

• Identifying the function of unknown genes…

Page 60: 2014 nyu-bio-talk

Summer NGS workshop (2010-2017)

Page 61: 2014 nyu-bio-talk

The infrastructure challenge

In 5-10 years, we will have nigh-infinite data.

(Genomic, transcriptomic, proteomic, metabolomic,

…?)

We currently have no good way of querying,

exploring, investigating, or mining these data sets,

especially across multiple locations..

Page 62: 2014 nyu-bio-talk

Distributed graph database server

Compute server

(Galaxy?

Arvados?)

Web interface + API

Data/

Info

Raw data sets

Public

servers

"Walled

garden"

server

Private

server

Graph query layer

Upload/submit

(NCBI, KBase)

Import

(MG-RAST,

SRA, EBI)

Page 63: 2014 nyu-bio-talk

Data integration?

Once you have all the data, what do you do?

"Business as usual simply cannot work."

Looking at millions to billions of genomes.

(David Haussler, 2014)

Page 64: 2014 nyu-bio-talk

Putting it in perspective:

Total equivalent of ~1200 bacterial genomes

Human genome ~3 billion bp

My charge: We don’t know what most genes do.

Total

Assembly

Total Contigs

(> 300 bp)

% Reads

Assembled

Predicted

protein

coding

2.5 bill 4.5 mill 19% 5.3 mill

3.5 bill 5.9 mill 22% 6.8 mill

Howe et al, 2014; pmid 24632729

Page 65: 2014 nyu-bio-talk

Data Intensive Biology

Opportunities & challenges; how can we best support the

biology?

"I have traveled the length and breadth of this

country and talked with the best people, and I can

assure you that data processing is a fad that won't

last out the year." --The editor in charge of business

books for Prentice Hall, 1957

Page 66: 2014 nyu-bio-talk

Thanks!

Key points:

• Facing nigh-infinite data situation;

• The first stages of sequence analysis, assembly and variant

calling, are computationally intensive (but we’re hoping to fix

that);

• Training in data intensive biology is critical to the future of

biology.

• Data sharing and data integration infrastructure is also critical.

Page 67: 2014 nyu-bio-talk

Graph alignment can detect read saturation

Page 68: 2014 nyu-bio-talk

Proposal: distributed graph database server

Compute server

(Galaxy?

Arvados?)

Web interface + API

Data/

Info

Raw data sets

Public

servers

"Walled

garden"

server

Private

server

Graph query layer

Upload/submit

(NCBI, KBase)

Import

(MG-RAST,

SRA, EBI)

Page 69: 2014 nyu-bio-talk

Proposal: distributed graph database server

Compute server

(Galaxy?

Arvados?)

Web interface + API

Data/

Info

Raw data sets

Public

servers

"Walled

garden"

server

Private

server

Graph query layer

Upload/submit

(NCBI, KBase)

Import

(MG-RAST,

SRA, EBI)

Page 70: 2014 nyu-bio-talk

Proposal: distributed graph database server

Compute server

(Galaxy?

Arvados?)

Web interface + API

Data/

Info

Raw data sets

Public

servers

"Walled

garden"

server

Private

server

Graph query layer

Upload/submit

(NCBI, KBase)

Import

(MG-RAST,

SRA, EBI)

Page 71: 2014 nyu-bio-talk

Proposal: distributed graph database server

Compute server

(Galaxy?

Arvados?)

Web interface + API

Data/

Info

Raw data sets

Public

servers

"Walled

garden"

server

Private

server

Graph query layer

Upload/submit

(NCBI, KBase)

Import

(MG-RAST,

SRA, EBI)

Page 72: 2014 nyu-bio-talk

Graph queries

assembled

sequence

nitrite

reductaseppaZ

SIMILARITY TO ALSO CONTAINS

raw

sequence

across public & walled-garden data sets:

See Lee,

Alekseyenko, Brown,

paper in SciPy 2009:

the “pygr” project.