why is bioinformatics a good fit for spark?

Why is Bioinformatics (well, really, “genomics”) a Good Fit for Spark? Timothy Danford AMPLab

Upload: timothy-danford

Post on 26-Jun-2015

1.192 views

Category:

Health & Medicine

2 download

Report

Download

Tags:

Embed Size (px):

DESCRIPTION

DNA sequencing is producing a wave of data which will change the way that drugs are developed, patients diagnosed, and our understanding of human biology. To fulfill this promise, however, the tools for interpretation and analysis must scale to match the quantity and diversity of "big data genomics." ADAM is an open-source genomics processing engine, built using Spark, Apache Avro, and Parquet. This talk will discuss some of the advantages that the Spark platform brings to genomics, the benefits of using technologies like Parquet in conjunction with Spark, and the challenges of adapting new technologies for existing tools in bioinformatics. These are slides for a talk given at the Apache Spark Meetup in Boston on October 20, 2014.

TRANSCRIPT

Page 1: Why is Bioinformatics a Good Fit for Spark?

Why is Bioinformatics(well, really, “genomics”)

a Good Fit for Spark?Timothy Danford

AMPLab

Page 2: Why is Bioinformatics a Good Fit for Spark?

A One-Slide Introduction to Genomics

Page 3: Why is Bioinformatics a Good Fit for Spark?

Bioinformatics computation is batch processing and workflows

● Bioinformatics has a lot of “workflow engines”○ Galaxy, Taverna, Firehose, Zamboni,

Queue, Luigi, bPipe○ bash scripts○ even make, fer cryin’ out loud○ a new one every day

● Bioinformatics software development is still largely a research activity

Page 4: Why is Bioinformatics a Good Fit for Spark?

State-of-the-Art infrastructure: shared filesystems, handwritten parallelism● Hand-written task creation

● File formats instead of APIs or data models

○ formats are poorly defined○ contain optional or

redundant fields○ semantics are unclear

● Workflow engines can’t take advantage of common parallelism between stages

Page 5: Why is Bioinformatics a Good Fit for Spark?

Page 6: Why is Bioinformatics a Good Fit for Spark?

So, why Spark?

Page 7: Why is Bioinformatics a Good Fit for Spark?

Most of Genomics is 1-D Geometry

Page 8: Why is Bioinformatics a Good Fit for Spark?

Most of Genomics is 1-D Geometry

Page 9: Why is Bioinformatics a Good Fit for Spark?

The rest is iterative evaluation of probabilistic models!

Page 10: Why is Bioinformatics a Good Fit for Spark?

Spark RDDs and Partitioners allow declarative parallelization for genomics

● Genomics computation is parallelized in a small, standard number of ways○ by position○ by sample

● Declarative, flexible partitioning schemes are useful

Page 11: Why is Bioinformatics a Good Fit for Spark?

Spark can easily express genomics primitives: join by genomic overlap

1. Calculate disjoint regions based on left (blue) set

2. Partition both sets by disjoint regions

3. Merge-join within each partition

4. (Optional) aggregation across joined pairs

Page 12: Why is Bioinformatics a Good Fit for Spark?

ADAM is Genomics + Spark

● Combines three technologies○ Spark○ Parquet○ Avro

● Apache 2-licensed● Started at the AMPLab

● A rewrite of core bioinformatics tools and algorithms in Spark

http://bdgenomics.org/

Page 13: Why is Bioinformatics a Good Fit for Spark?

Avro and Parquet are just as critical to ADAM as Spark

● Avro to define data models● Parquet for serialization format● Still need to answer design

questions○ how wide are the schemas?○ how much do we follow existing

formats?○ how do carry through projections?

Page 14: Why is Bioinformatics a Good Fit for Spark?

Still need to convince bioinformaticians to rewrite their software!

Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)

Page 15: Why is Bioinformatics a Good Fit for Spark?

Still need to convince bioinformaticians to rewrite their software!

● A single piece of a single filtering stage for a somatic variant caller

● “11-base-pair window centered on a candidate mutation” actually turns out to be optimized for a particular file format and sort order

Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)

Page 16: Why is Bioinformatics a Good Fit for Spark?

The Future: Distributed and Incremental?

● Today: 5k samples x 20 Gb / sample● Tomorrow: 1m+ samples @ 200+ Gb / sample?● More and more analysis is aggregative

○ joint variant calling, ○ panels of normal samples, ○ collective variant annotation

● And “data collection” will never be finished

Page 17: Why is Bioinformatics a Good Fit for Spark?

Thank you!(questions?)

Acknowledgements

Matt Massie (AMPLab)Frank Nothaft (AMPLab) Carl Yeksigian (DataStax)

Anthony Philippakis (Broad Institute) Jeff Hammerbacher (Cloudera / Mt. Sinai)

EB3233 Bioinformatics Introduction to Bioinformatics

+ => Bioinformatics: from Sequence to Knowledge Outline: Introduction to bioinformatics The TAU Bioinformatics unit Useful bioinformatics issues and databases:

A Comparison ofthe Fit of Spark-Eroded Titanium Copings and Cast

Marco Capuccini - DiVA portaluu.diva-portal.org/smash/get/diva2:827871/FULLTEXT03.pdf · Structure-Based Virtual Screening in Spark Marco Capuccini Degree project in bioinformatics,

| Bioinformatics USC Libraries Bioinformatics Service · USC Libraries Bioinformatics Service ... Galaxy, R & Bioconductor Bioinformatics Servers Hardware: Two Dell PowerEdge R630

REPLACEMENT SPARK PLUGS Spark Plug Application Chart · REPLACEMENT SPARK PLUGS Spark Plug Application Chart ... EC Series Air-Cooled 1 ... REPLACEMENT SPARK PLUGS Spark Plug Application

Introduction to Bioinformatics · biopotato bioinformatics Introduction to Bioinformatics. 8 What is Bioinformatics? Interdisciplinary a biology/medical researchers, just like you

SPARK SPARK VRT

Bioinformatics, Translational Bioinformatics, Personalized Medicine

Bioinformatics for molecular biology€¦ · Bioinformatics for molecular biology Structural bioinformatics tools, predictors, and 3D modeling –Structural Bioinformatics DrJon K

Spark Plug Thread Repair Spark Plug Spark Plug Sockets for

SEA-DOO · PDF fileGET MORE DETAILS ABOUT SEA-DOO SPARK AT SEA-DOO.COM. IN 5 SIMPLE STEPS. PICK AND FIT YOUR Sea-Doo SPARK 2up Capacity: 2 persons / 350 lb (159 kg)

Title: Strategies to achieve fit in prosthodontics ... · Background Introduction: Spark erosion technology is a highly advanced system for producing the ultimate in precision fit

CS5263 Bioinformatics Lecture 1: Introduction Outline Administravia What is bioinformatics Why bioinformatics Topics in bioinformatics What you will

Immunological Bioinformatics. The Immunological Bioinformatics group Immunological Bioinformatics group, CBS, Technical University of Denmark ()

What bioinformatics? What is bioinformatics?

Introduction to Bioinformatics - hu-berlin.de · 2017-04-21 · Introduction to Bioinformatics Johannes Starlinger. Johannes Starlinger: Bioinformatics, Summer Semester 2017 2 Bioinformatics

Spark SQL and DataFrames Spark GraphX Spark Mlib Spark ...Spark GraphX! Spark Mlib! Spark Streaming Lightning-fast cluster computing. Chaining transformations 2. ... Covert RDD to

About Bioinformatics Courses under DOEACCnielit.gov.in/sites/default/files/BioInformatics... · 2015-10-28 · About Bioinformatics Courses under DOEACC . The course in Bioinformatics

SAE Spark Erosion Implant-retained prosthesis with tension ... · passive fit using SAE spark erosion to firmly secure the prosthesis by way of friction pins and 2 latches The delicate

Bioinformatics for molecular biology - Wiki.uio.no · Bioinformatics for molecular biology Structural bioinformatics tools, predictors, and 3D modeling –Structural Bioinformatics

Bioinformatics 2013 Li Bioinformatics Btt029

Quality Logo Design, Graphic Design, Web Development - Spark … · 2019-12-24 · We are excited to quote on your logo design project and believe that Spark is a perfect fit for

Spark streaming , Spark SQL

Spark SQL | Apache Spark

Introduction to Bioinformatics - hu-berlin.de · Introduction to Bioinformatics Ulf Leser . Ulf Leser: Bioinformatics, Summer Semester 2016 2 Bioinformatics 25.4.2003 50. Jubiläum

Apache Ignite and Apache Spark - GridGain Systems · Ignite and Spark Integration Spark Application Spark Worker Spark Job Spark Job Yarn Mesos Docker HDFS Spark Worker Spark Job

Big Data with MATLAB and Spark - MathWorks · Big Data with MATLAB and Spark Pierre Harouimi. 2 ... Big Data with MATLAB & Spark datastore Data that don’t fit in memory ACCESS DATA

Spark Custom Logo Presentation Draft€¦ · We are excited to quote on your logo design project and believe that Spark is a perfect fit for your needs. The graphic below lays out

Spark, spark streaming & tachyon

Introduction to Bioinformatics … · Introduction to Bioinformatics Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2005.02 SIB and EMBnet Bioinformatics

Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

bioinformatics secrets The Bioinformatics Skill Systemangus.readthedocs.io/en/2014/_static/2014-rpg.pdf · The Bioinformatics Skill System bioinformatics secrets 1. ... bioinformatics

Bioinformatics Proteomics Lecture 9Bioinformatika 2 Way to assess the fit: 1. Consideration of simple geometric fit 2. Evaluation of the fit: a complex energy function, electrostatic

Doug Brutlag 2011 Bioinformatics Genomics, Bioinformatics

why is bioinformatics a good fit for spark?

Health & Medicine

spark avro

spark rdds

express genomics primitives

workflows bioinformatics

bioinformatics computation

sample declarative

samples x

nature biotechnology