Transcript
Page 1: Why is Bioinformatics a Good Fit for Spark?

Why is Bioinformatics(well, really, “genomics”)

a Good Fit for Spark?Timothy Danford

AMPLab

Page 2: Why is Bioinformatics a Good Fit for Spark?

A One-Slide Introduction to Genomics

Page 3: Why is Bioinformatics a Good Fit for Spark?

Bioinformatics computation is batch processing and workflows

● Bioinformatics has a lot of “workflow engines”○ Galaxy, Taverna, Firehose, Zamboni,

Queue, Luigi, bPipe○ bash scripts○ even make, fer cryin’ out loud○ a new one every day

● Bioinformatics software development is still largely a research activity

Page 4: Why is Bioinformatics a Good Fit for Spark?

State-of-the-Art infrastructure: shared filesystems, handwritten parallelism● Hand-written task creation

● File formats instead of APIs or data models

○ formats are poorly defined○ contain optional or

redundant fields○ semantics are unclear

● Workflow engines can’t take advantage of common parallelism between stages

Page 5: Why is Bioinformatics a Good Fit for Spark?
Page 6: Why is Bioinformatics a Good Fit for Spark?

So, why Spark?

Page 7: Why is Bioinformatics a Good Fit for Spark?

Most of Genomics is 1-D Geometry

Page 8: Why is Bioinformatics a Good Fit for Spark?

Most of Genomics is 1-D Geometry

Page 9: Why is Bioinformatics a Good Fit for Spark?

The rest is iterative evaluation of probabilistic models!

Page 10: Why is Bioinformatics a Good Fit for Spark?

Spark RDDs and Partitioners allow declarative parallelization for genomics

● Genomics computation is parallelized in a small, standard number of ways○ by position○ by sample

● Declarative, flexible partitioning schemes are useful

Page 11: Why is Bioinformatics a Good Fit for Spark?

Spark can easily express genomics primitives: join by genomic overlap

1. Calculate disjoint regions based on left (blue) set

2. Partition both sets by disjoint regions

3. Merge-join within each partition

4. (Optional) aggregation across joined pairs

Page 12: Why is Bioinformatics a Good Fit for Spark?

ADAM is Genomics + Spark

● Combines three technologies○ Spark○ Parquet○ Avro

● Apache 2-licensed● Started at the AMPLab

● A rewrite of core bioinformatics tools and algorithms in Spark

http://bdgenomics.org/

Page 13: Why is Bioinformatics a Good Fit for Spark?

Avro and Parquet are just as critical to ADAM as Spark

● Avro to define data models● Parquet for serialization format● Still need to answer design

questions○ how wide are the schemas?○ how much do we follow existing

formats?○ how do carry through projections?

Page 14: Why is Bioinformatics a Good Fit for Spark?

Still need to convince bioinformaticians to rewrite their software!

Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)

Page 15: Why is Bioinformatics a Good Fit for Spark?

Still need to convince bioinformaticians to rewrite their software!

● A single piece of a single filtering stage for a somatic variant caller

● “11-base-pair window centered on a candidate mutation” actually turns out to be optimized for a particular file format and sort order

Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)

Page 16: Why is Bioinformatics a Good Fit for Spark?

The Future: Distributed and Incremental?

● Today: 5k samples x 20 Gb / sample● Tomorrow: 1m+ samples @ 200+ Gb / sample?● More and more analysis is aggregative

○ joint variant calling, ○ panels of normal samples, ○ collective variant annotation

● And “data collection” will never be finished

Page 17: Why is Bioinformatics a Good Fit for Spark?

Thank you!(questions?)

Acknowledgements

Matt Massie (AMPLab)Frank Nothaft (AMPLab) Carl Yeksigian (DataStax)

Anthony Philippakis (Broad Institute) Jeff Hammerbacher (Cloudera / Mt. Sinai)


Top Related