Download - Why is Bioinformatics a Good Fit for Spark?

Why is Bioinformatics(well, really, “genomics”)

a Good Fit for Spark?Timothy Danford

AMPLab

A One-Slide Introduction to Genomics

Bioinformatics computation is batch processing and workflows

● Bioinformatics has a lot of “workflow engines”○ Galaxy, Taverna, Firehose, Zamboni,

Queue, Luigi, bPipe○ bash scripts○ even make, fer cryin’ out loud○ a new one every day

● Bioinformatics software development is still largely a research activity

State-of-the-Art infrastructure: shared filesystems, handwritten parallelism● Hand-written task creation

● File formats instead of APIs or data models

○ formats are poorly defined○ contain optional or

redundant fields○ semantics are unclear

● Workflow engines can’t take advantage of common parallelism between stages

So, why Spark?

Most of Genomics is 1-D Geometry

The rest is iterative evaluation of probabilistic models!

Spark RDDs and Partitioners allow declarative parallelization for genomics

● Genomics computation is parallelized in a small, standard number of ways○ by position○ by sample

● Declarative, flexible partitioning schemes are useful

Spark can easily express genomics primitives: join by genomic overlap

1. Calculate disjoint regions based on left (blue) set

2. Partition both sets by disjoint regions

3. Merge-join within each partition

4. (Optional) aggregation across joined pairs

ADAM is Genomics + Spark

● Combines three technologies○ Spark○ Parquet○ Avro

● Apache 2-licensed● Started at the AMPLab

● A rewrite of core bioinformatics tools and algorithms in Spark

http://bdgenomics.org/



Avro and Parquet are just as critical to ADAM as Spark

● Avro to define data models● Parquet for serialization format● Still need to answer design

questions○ how wide are the schemas?○ how much do we follow existing

formats?○ how do carry through projections?

Still need to convince bioinformaticians to rewrite their software!

Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)

Still need to convince bioinformaticians to rewrite their software!

● A single piece of a single filtering stage for a somatic variant caller

● “11-base-pair window centered on a candidate mutation” actually turns out to be optimized for a particular file format and sort order

Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)

The Future: Distributed and Incremental?

● Today: 5k samples x 20 Gb / sample● Tomorrow: 1m+ samples @ 200+ Gb / sample?● More and more analysis is aggregative

○ joint variant calling, ○ panels of normal samples, ○ collective variant annotation

● And “data collection” will never be finished

Thank you!(questions?)

Acknowledgements

Matt Massie (AMPLab)Frank Nothaft (AMPLab) Carl Yeksigian (DataStax)

Anthony Philippakis (Broad Institute) Jeff Hammerbacher (Cloudera / Mt. Sinai)

Download - Why is Bioinformatics a Good Fit for Spark?

Top Related