building cloud-enabled genomics workflows with luigi and docker

55
Building cloud-enabled cancer genomics workflows with Luigi and Docker Jake Feala, PhD Principal Scientist, Bioinformatics @ Caperna Founder @ Outlier Bio

Upload: jacob-feala

Post on 16-Apr-2017

1.582 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Building cloud-enabled genomics workflows with Luigi and Docker

Building cloud-enabled cancer genomics workflows with Luigi and DockerJake Feala, PhDPrincipal Scientist, Bioinformatics @ CapernaFounder @ Outlier Bio

Page 2: Building cloud-enabled genomics workflows with Luigi and Docker

Building cloud-enabled cancer genomics workflows with Luigi and DockerJake Feala, PhDPrincipal Scientist, Bioinformatics @ CapernaFounder @ Outlier Bio

Page 3: Building cloud-enabled genomics workflows with Luigi and Docker

Outline

Pipelines

Requirements for the ideal workflow

platform

ChallengesPractical, free alternatives

Convergence of cloud bioinformatics

platforms

Luigi

Docker

S3*Application:

Confirming cancer mutations in RNA-seq

*Not technically free but nearly**FOSS = Free and Open Source Software

Some enabling infrastructure

Takeaways: 1. Engineer your workflows2. Everyone is converging on architecture3. You can build scalable pipelines with FOSS** tools

Copyright © 2016 Outlier Bio

Page 4: Building cloud-enabled genomics workflows with Luigi and Docker

data >> algorithms

Copyright © 2016 Outlier Bio

Page 5: Building cloud-enabled genomics workflows with Luigi and Docker

When in doubt, just sequence it!

Plummeting sequencing costs

+

Powerful insights

= Mounds of data

Copyright © 2016 Outlier Bio

Page 6: Building cloud-enabled genomics workflows with Luigi and Docker

managing big genomics data is hard

Copyright © 2016 Outlier Bio

Page 7: Building cloud-enabled genomics workflows with Luigi and Docker

Pipelines

What you show your boss

AlignFastQ BAM

What you actually do

Align each lane

Merge SAM

Build index

Update IGV registry

Index BAM

SAM BAM

Sort BAM

Copyright © 2016 Outlier Bio

Page 8: Building cloud-enabled genomics workflows with Luigi and Docker

Pipelines

What you show your boss

Align

And sometimes….

FastQ BAM

What you actually do

Align each lane

Merge SAM

Build index

Update IGV registry

Index BAM

SAM BAM

Sort BAM

Copyright © 2016 Outlier Bio

Page 9: Building cloud-enabled genomics workflows with Luigi and Docker

Pipelines

What you show your boss

Align

And sometimes….

FastQ BAM

What you actually do

Align each lane

Merge SAM

Build index

Update IGV registry

Index BAM

SAM BAM

Sort BAM

Copyright © 2016 Outlier Bio

And BTW these files are massive

Page 10: Building cloud-enabled genomics workflows with Luigi and Docker

Capability = automated pipeline*

• Pipelines are at the core of a bioinformatics group

• Best way to show a capability is a complete pipeline– Any suitable input quality-assured output– Sure, you may be able to do a certain analysis,

but a working, tested pipeline proves it

• Lots of tools are available, but putting them together correctly into a pipeline is hard

*”workflow” is synonymous with “pipeline” for the purposes of this talk

Lots of possible input sources

and types

Lots of possible downstream

analyses

Core capability

Copyright © 2016 Outlier Bio

Page 11: Building cloud-enabled genomics workflows with Luigi and Docker

good data science requires good data engineering

Copyright © 2016 Outlier Bio

Page 12: Building cloud-enabled genomics workflows with Luigi and Docker

Linux commands and custom file formats are the “narrow waist” of bioinformatics

Copyright © 2016 Outlier Bio

This constraint underlies the design of bioinformatics pipelines

Page 13: Building cloud-enabled genomics workflows with Luigi and Docker

This works great for tool developers…

Copyright © 2016 Outlier Bio

Page 14: Building cloud-enabled genomics workflows with Luigi and Docker

…but quickly gets complicated for users

Common homegrown solution:- Bash scripts- VMs (AMIs)- StarCluster or similar- SunGrid Engine or similar- Parallelize by sample

Copyright © 2016 Outlier Bio

B

Page 15: Building cloud-enabled genomics workflows with Luigi and Docker

Outline

Pipelines

Requirements for the ideal workflow

platform

ChallengesPractical, free alternatives

Convergence of cloud bioinformatics

platforms

Luigi

Docker

S3*Application:

Confirming cancer mutations in RNA-seq

*Not technically free but nearly**FOSS = Free and Open Source Software

Some enabling infrastructure

Takeaways: 1. Engineer your workflows2. Everyone is converging on architecture3. You can build scalable pipelines with FOSS tools

Copyright © 2016 Outlier Bio

Page 16: Building cloud-enabled genomics workflows with Luigi and Docker

Limitations with commonly used technology

• Compute– SGE is flat, just a job scheduler, bash only, no dependencies

• Environment– AMIs and VMs are slow and heavyweight, no code, not good for

ongoing dev• Workflow

– Makefiles have ugly syntax, static, inflexible– GUIs and domain-specific systems like Galaxy/Taverna are not

easily programmable or general-purpose

Copyright © 2016 Outlier Bio

Page 17: Building cloud-enabled genomics workflows with Luigi and Docker

Properties of the ideal pipeline system

• General purpose: familiar language, can apply to any task

• Modular: any language, well-tested components with tight APIs

• Scalable: parallelize for free, independent of components

• Integrated: LIMS, metadata, viz, versioning, reporting

• Versioned: reproduce from snapshots in time

• Idempotent: resume from failure, guarantee outputs

Copyright © 2016 Outlier Bio

Page 18: Building cloud-enabled genomics workflows with Luigi and Docker

Outline

Pipelines

Requirements for the ideal workflow

platform

ChallengesPractical, free alternatives

Convergence of cloud bioinformatics

platforms

Luigi

Docker

S3*Application:

Confirming cancer mutations in RNA-seq

*Not technically free but nearly**FOSS = Free and Open Source Software

Some enabling infrastructure

Takeaways: 1. Engineer your workflows2. Everyone is converging on architecture3. You can build scalable pipelines with FOSS tools

Copyright © 2016 Outlier Bio

Page 19: Building cloud-enabled genomics workflows with Luigi and Docker

Emerging best practice is to containerize and decompose into atomic steps

Copyright © 2016 Outlier Bio

Page 20: Building cloud-enabled genomics workflows with Luigi and Docker

But atomicity and scaling bring added complexity

Copyright © 2016 Outlier Bio

Page 21: Building cloud-enabled genomics workflows with Luigi and Docker

Everyone is doing this same basic architecture

Copyright © 2016 Outlier Bio

Page 22: Building cloud-enabled genomics workflows with Luigi and Docker

That is, everyone except for ADAM/Spark. They are shifting the paradigm

• Pipeline = sequence of transformations on a data model, not a string of shell commands

• Separate method from implementation• Distribute records (i.e. alignments) for horizontal scaling• Needs more community tools before adoption

Fig 1, Nothaft et al. (2015) Rethinking Data-Intensive Science Using Scalable Analytics Systems

Copyright © 2016 Outlier Bio

https://bdgenomics.org

Page 23: Building cloud-enabled genomics workflows with Luigi and Docker

Outline

Pipelines

Requirements for the ideal workflow

platform

ChallengesPractical, free alternatives

Convergence of cloud bioinformatics

platforms

Luigi

Docker

S3*Application:

Confirming cancer mutations in RNA-seq

*Not technically free but nearly**FOSS = Free and Open Source Software

Some enabling infrastructure

Takeaways: 1. Engineer your workflows2. Everyone is converging on architecture3. You can build scalable pipelines with FOSS tools

Copyright © 2016 Outlier Bio

Page 24: Building cloud-enabled genomics workflows with Luigi and Docker

A free alternative architecture

Components• Storage: S3• Environment: Docker• Workflows: Luigi• Compute: EC2 auto-scaling groups• Infrastructure: to put the pieces together

Benefits• Free (almost)• Huge communities, battle tested• Separate concerns, use best tool for each job

Copyright © 2016 Outlier Bio

Page 25: Building cloud-enabled genomics workflows with Luigi and Docker

S3 and object storage

Steps:1. Learn how object store is different from a filesystem2. Write a bit of extra code to manage transfers to/from S33. Never think about long-term storage again*

*within reason, if IT director is in the room.Copyright © 2016 Outlier Bio

Page 26: Building cloud-enabled genomics workflows with Luigi and Docker

Docker

Copyright © 2016 Outlier BioFrom https://www.docker.com/what-docker

• Lightweight environment management• Specify environment as code (Dockerfile)• Portable (laptop cluster with same behavior)• Wide adoption in tech• Gaining ground in bioinformatics

Page 27: Building cloud-enabled genomics workflows with Luigi and Docker

Luigi

• Like a Makefile but in pure Python*

• Executes locally => easy to debug

• OOP declarative framework with Tasks and Targets– Tasks are Python objects, they can do anything– Targets are Python objects, they can be anything

• Idempotent (resume where you left off) and atomic (no half-finished tasks)

• Batteries included (graph viz, CLI integration, S3, MySQL, Hadoop, Spark, Redshift, SGE, …

*i.e., not another DSL

What’s a Makefile?• Build DAG of tasks• Each task specifies dependencies and output• Start with what you want to build, work up

Copyright © 2016 Outlier Bio

Page 28: Building cloud-enabled genomics workflows with Luigi and Docker

Luigi is flexible enough to do pretty much anything within a workflow

S3

EC2/Docker

Python code

S3

Database

Hadoop/Spark

Local client

Local file

storage

compute

data flows

Copyright © 2016 Outlier Bio

Local file

Manual operation!

Page 29: Building cloud-enabled genomics workflows with Luigi and Docker

Luigi is flexible enough to do pretty much anything within a workflow

S3

EC2/Docker

Python code

S3

Database

Hadoop/Spark

Local client

Local file

storage

compute

data flows

Copyright © 2016 Outlier Bio

Local file

Manual operation!

luigi.Target

luigi.Task

luigi.Task.requires

API

Page 30: Building cloud-enabled genomics workflows with Luigi and Docker

Anatomy of a Luigi Task

Copyright © 2016 Outlier Bio

Page 31: Building cloud-enabled genomics workflows with Luigi and Docker

Anatomy of a Luigi Task

Copyright © 2016 Outlier Bio

Build your filepaths in Task.output

Declare your parameters (can be passed from CLI)

Do the thingRead the input

Write the output

Page 32: Building cloud-enabled genomics workflows with Luigi and Docker

Anatomy of a Luigi Task

Copyright © 2016 Outlier Bio

Now output to S3 instead

Page 33: Building cloud-enabled genomics workflows with Luigi and Docker

Anatomy of a Luigi Task

Copyright © 2016 Outlier Bio

Run a Docker container instead

Page 34: Building cloud-enabled genomics workflows with Luigi and Docker

Outline

Pipelines

Requirements for the ideal workflow

platform

ChallengesPractical, free alternatives

Convergence of cloud bioinformatics

platforms

Luigi

Docker

S3*Application:

Confirming cancer mutations in RNA-seq

*Not technically free but nearly**FOSS = Free and Open Source Software

Some enabling infrastructure

Takeaways: 1. Engineer your workflows2. Everyone is converging on architecture3. You can build scalable pipelines with FOSS tools

Copyright © 2016 Outlier Bio

Page 35: Building cloud-enabled genomics workflows with Luigi and Docker

Some assembly required

Copyright © 2016 Outlier Bio

ECS

EC2 auto-scaling group

Local client

Container registry

EC2

DockerECS agent

CloudWatch

Luigi

EC2

DockerECS agent

SQS

SGE

But lots of ways to do it!

Page 36: Building cloud-enabled genomics workflows with Luigi and Docker

Outline

Application: Confirming cancer

mutations in RNA-seq

*Not technically free but nearly**FOSS = Free and Open Source Software

Some enabling infrastructure

Takeaways: 1. Engineer your workflows2. Everyone is converging on architecture3. You can build scalable pipelines with FOSS tools

Copyright © 2016 Outlier Bio

Pipelines

Requirements for the ideal workflow

platform

ChallengesPractical, free alternatives

Convergence of cloud bioinformatics

platforms

Luigi

Docker

S3*

Page 37: Building cloud-enabled genomics workflows with Luigi and Docker

• Problem:– Calling somatic cancer mutations is difficult

• Low allele frequencies• Complex variants• Purity, ploidy issues

– RNA-seq can add confidence, but• TCGA does not provide RNA-seq variants• Manual review is slow• Automating and scaling requires a complex, custom workflow

An interesting* cancer genomics application

Copyright © 2016 Outlier Bio*to me

Page 38: Building cloud-enabled genomics workflows with Luigi and Docker

Solution: RNA-seq mutation validation pipeline

Copyright © 2016 Outlier Bio

CGHub

GDAC portal

Note: this workflow should soon be feasible with the NCI cloud pilot platforms

Download RNA-seq

Download somatic

mutations

RNA-seq(BAM)

DNA variants

(VCF)

Extract RNA-seq regions

Local region(BAM) MuTect

RNA variants

(VCF)

Merge RNA/DNA variants

for each variant

for each patient

Merge patient variantsMerge cohort

variants

Page 39: Building cloud-enabled genomics workflows with Luigi and Docker

Dockerizing the apps

Copyright © 2016 Outlier Bio

GTDownload ./bioit/apps/gtdownload/Dockerfile

Samtools ./bioit/apps/samtools/Dockerfile

See https://github.com/outlierbio/bio-it for full project

$ docker build –t outlierbio/bioit .

Base image ./Dockerfile

$ cd bioit/apps/gtdownload$ docker build –t outlierbio/gtdownload .

$ cd ../samtools$ docker build –t outlierbio/samtools .

Page 40: Building cloud-enabled genomics workflows with Luigi and Docker

Luigi-izing the workflow

Copyright © 2016 Outlier BioSee https://github.com/outlierbio/bio-it for full project

Page 41: Building cloud-enabled genomics workflows with Luigi and Docker

Luigi-izing the workflow

Copyright © 2016 Outlier BioSee https://github.com/outlierbio/bio-it for full project

Page 42: Building cloud-enabled genomics workflows with Luigi and Docker

Luigi-izing the workflow

Copyright © 2016 Outlier BioSee https://github.com/outlierbio/bio-it for full project

Page 43: Building cloud-enabled genomics workflows with Luigi and Docker

Run it!

Copyright © 2016 Outlier BioSee https://github.com/outlierbio/bio-it for full project

$ luigi --module bioit.validate_rnaseq RunSample --sample-id=TCGA-AB-2929

Page 44: Building cloud-enabled genomics workflows with Luigi and Docker

Thanks! Questions?

Pipelines

Requirements for the ideal workflow

platform

ChallengesPractical, free alternatives

Convergence of cloud bioinformatics

platforms

Luigi

Docker

S3*Application:

Confirming cancer mutations in RNA-seq

*Not technically free but nearly**FOSS = Free and Open Source Software

Some enabling infrastructure

Takeaways: 1. Engineer your workflows2. Everyone is converging on architecture3. You can build scalable pipelines with FOSS tools

Copyright © 2016 Outlier Bio

Further reading• My blog: https://medium.com/outlier-bio-blog• GitHub repo for this talk:

https://github.com/outlierbio/bioit • Luigi: https://github.com/spotify/luigi• Docker: https://docker.com• Amazon AWS: http://lmgtfy.com/?q=aws

Page 45: Building cloud-enabled genomics workflows with Luigi and Docker

Extra slides

Copyright © 2016 Outlier Bio

Page 46: Building cloud-enabled genomics workflows with Luigi and Docker

More on reproducibility: 3 major components and some possible solutions

• Code– VCS– Notebooks

• Data– Metadata store– S3– Synapse– Versioned data APIs

• Environment– Package managers (pip –r)– VM– Docker

Copyright © 2016 Outlier Bio

Page 47: Building cloud-enabled genomics workflows with Luigi and Docker

The slide about how this is kind of like functional programming

function Immutable output

• No side effects (pure)• Stateless• Same output for every input• Compose complex, scalable functionality from small components

Immutable input

containerized data

transformation

Immutable S3 object

Immutable S3 object

Copyright © 2016 Outlier BioSee http://tech.adroll.com/blog/data/2015/09/22/data-pipelines-docker.html for more on this idea

Page 48: Building cloud-enabled genomics workflows with Luigi and Docker

• Specify environment as code (Dockerfile)

• A little like git– commits (each with uuid hash)– pull– push

• Isolated like a VM

• Run like an app– Takes arguments– CLI allows interactive use– Hop inside the container to debug

Docker (user perspective)

Copyright © 2016 Outlier Bio

Page 49: Building cloud-enabled genomics workflows with Luigi and Docker

EC2 auto-scaling groups

Copyright © 2016 Outlier BioFrom http://docs.aws.amazon.com/AutoScaling/

Page 50: Building cloud-enabled genomics workflows with Luigi and Docker

Dispatching pattern allows fan-in from data sources, fan-out to pipelines

Load sample BAM

CGHub download

S3 download

Local fileNCBI SRA download

CGHub extract FQSRA extract FQ

HTTP/FTP download

exomeRNA-seq

xenograftWGS

shRNA

Copyright © 2016 Outlier Bio

Page 51: Building cloud-enabled genomics workflows with Luigi and Docker

Dispatching pattern allows fan-in from data sources, fan-out to pipelines

Load sample

CGHub download

S3 download

Local fileNCBI SRA download

CGHub extract FQSRA extract FQ

HTTP/FTP download

exomeRNA-seq

xenograftWGS

shRNA

Copyright © 2016 Outlier Bio

Luigi makes this easy

Page 52: Building cloud-enabled genomics workflows with Luigi and Docker

• Base Dockerfile with AWS credentials and helpers

• Simple metadata DB to map sample IDs raw data locations– All other filepaths managed by pipeline code– Level-up: store pipeline versions, parameters, batches

• Small test dataset for constant integration tests (use Jenkins)

Practical implementation notes

Copyright © 2016 Outlier Bio

Page 53: Building cloud-enabled genomics workflows with Luigi and Docker

Challenges

• Filesystem: Still need a BIG, FAST local filesystem for NGS analysis

• Luigi has warts– Parameter passing– Delete files to re-run– Lots of code edits to rewire the DAG

Copyright © 2016 Outlier Bio

Page 54: Building cloud-enabled genomics workflows with Luigi and Docker

5-year timeline (wishlist?)

• More distributed algorithms with ADAM/Spark

• Integration with Jupyter notebooks

• Open source template for spinning up a secure cloud infrastructure

Copyright © 2016 Outlier Bio

Page 55: Building cloud-enabled genomics workflows with Luigi and Docker

Live demo???

Copyright © 2016 Outlier Bio