c4bio paper talk

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

From scripted HPC-based NGS pipelines to workflows on the cloud

Jacek Cała, Yaobo Xu, Eldarina Azfar Wijaya, Paolo Missier

School of Computing Science and Institute of Genetic MedicineNewcastle University, Newcastle upon Tyne, UK

C4Bio workshop @CCGrid 2014

Chicago, May 26th, 2014

C4B

io 2

014

@C

CG

rid, -

P.M

issi

erThe Cloud-e-Genome project

NGS data processing:

provide mechanisms to rapidly and flexibly create new exome sequence data processing pipelines, and to deploy them in a scalable way;

CostScalabilityFlexibility

Data to insightHuman variant interpretation for clinical diagnosis:provide clinicians with a tool for analysis and interpretation of human variants

• 2 year pilot project• Funded by UK’s National Institute for Health Research (NIHR)

through the Biomedical Research Council (BRC)• Nov. 2013: Cloud resources from Azure for Research Award

• 1 year’s worth of data/network/computing resources

Challenge:

to deliver the benefits of WES/WGS technology to clinical practice

C4B

io 2

014

@C

CG

rid, -

P.M

issi

erKey technical goals

• Scalability• In the rate and number of patient sequence submissions

• In the density of sequence data (from whole exome to whole genome)

• Flexibility, Traceability, Comparability across versions• Simplify experimenting with alternative pipelines (choice of tools, configuration

parameters)

• Trace each version and its executions

• Ability to compare results obtained using different pipelines and reason about the differences

• Openness. Simplify the process of adding:• New variant analysis tools

• New statistical methods for variant filtering, selection, and ranking

• Integration with third party databases

C4B

io 2

014

@C

CG

rid, -

P.M

issi

erApproach and testbed

Technical Approach:• double- porting

• Infrastructure: HPC cluster to cloud (IaaS)

• Implementation: NGS pipelines from scripts to workflow

• Implement user tools for clinical diagnosis as cloud apps (SaaS)

Testbed and scale:• Neurological patients from the North-East of England, focus on rare diseases

• Initial testing on about 300 sequences

• 2500-3000 sequences expected within 12 months

C4B

io 2

014

@C

CG

rid, -

P.M

issi

erWhy port to workflow?

• Programming:• Workflows provide better abstraction in the specification of pipelines

• Workflows directly executable by enactment engine

• Easier to understand, share, and maintain over time

• Flexible – relatively easy to introduce variations

• System: minimal installation/deployment requirements• Fewer dedicated technical staff hours required

• Automated dependency management, packaging, deployment

• Extensible by wrapping new tools

• Exploits available data parallelism (but not automagically)

• Reproducibility

• Execution monitoring, provenance collection• Persistence trace serves as evidence for data

• Amenable to automated analysis

C4B

io 2

014

@C

CG

rid, -

P.M

issi

erScripted pipeline

RecalibrationCorrects for system bias on quality scores assigned by sequencer

Computes coverage of each read.

VCF Subsetting by filtering, eg non-exomic variants

Annovar functional annotations (eg MAF, synonimity, SNPs…)followed by in house annotations

Aligns sample sequence to HG19 reference genomeusing BWA aligner

Cleaning, duplicate elimination

Picard tools

Variant calling operates on multiple samples simultaneouslySplits samples into chunks.Haplotype caller detects both SNV as well as longer indels

Variant recalibration attempts to reduce false positive rate from caller

C4B

io 2

014

@C

CG

rid, -

P.M

issi

erFrom scripts to workflows

C4B

io 2

014

@C

CG

rid, -

P.M

issi

erWorkflow nesting

C4B

io 2

014

@C

CG

rid, -

P.M

issi

erPipeline evolution

Pipeline:

set C = { c1 … cn } of components -- tool wrappers

Each ci has a configuration conf(ci) and a version v(ci)

…and why

• Technology / algorithm evolution• Traditional GATK variant caller

GATK haplotype caller• Does the interface change?• Do the operational assumptions

change?

Eg. GATK Variant Recalibrator requires large input data. Not suitable for targeted sequencing

What can change

1 – Tool version:v(ci) v’(ci)

2 - Tool replacement / add / remove:ci c’I

3 – Configuration parametersconf(ci) conf’(ci)

(*) S. Pabinger, A. Dander, M. Fischer, R. Snajder, M. Sperk, M. Efremova, B. Krabichler, M. R. Speicher, J. Zschocke, and Z. Trajanoski, “A survey of tools for variant analysis of next-generation genome sequencing data.” Briefings in bioinformatics, pp. bbs086–, Jan. 2013

Just for sequence alignment Pabinger et al. in their survey (*) list 17 aligners while for variant annotation they refer over 70 tools

C4B

io 2

014

@C

CG

rid, -

P.M

issi

erRole of provenance

Provenance refers to the sources of information, including entities and processes, involving in producing or delivering an artifact (*)

Provenance is a description of how things came to be, and how they came to be in the state they are in today (*)

• Provenance is evidence in support of clinical diagnosis1. Why do these variants appear in the output list?

2. Why have you concluded they are disease-causing?

• Requires ability to trace variants through workflow execution• Simple scripting lacks this functionality

“Where do these variants come from?”

“Why do these results differ?”

C4B

io 2

014

@C

CG

rid, -

P.M

issi

erComparing results across pipeline configurations

Run pipeline version V1

V1 V2:Replace BWA versionModify Annovar configuration parameters

Variant list VL1

Variant list VL2Run pipeline version V2

??

Variant list VL1

Variant list VL2

DDIFF(data differencing)

PDIFF(provenance differencing)

Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience (2013): doi:10.1002/cpe.3035.

C4B

io 2

014

@C

CG

rid, -

P.M

issi

erPDIFF - overview

WA

WB

C4B

io 2

014

@C

CG

rid, -

P.M

issi

erThe corresponding provenance traces

C4B

io 2

014

@C

CG

rid, -

P.M

issi

erDelta graph computed by PDIFF

PDIFF helps determine the impact of variations in the pipeline

C4B

io 2

014

@C

CG

rid, -

P.M

issi

erHPC Cluster configuration

16compute nodes

48/96GB RAM / 250GB disk

19TB usable storage space

Gigabit Ethernet

Shared resource for Institute-wide research

Submission script specifies node / core requirements

Computation waits until resources are available

Current config:

• BWA alignment: 2 cores

• GATK: 8 cores

C4B

io 2

014

@C

CG

rid, -

P.M

issi

erThe case for cloud in genome informatics (*)

(*) Stein, Lincoln D. “The Case for Cloud Computing in Genome Informatics.” Genome Biology 11, no. 5 (January 2010): 207.

• Storage + computing resources co-located

in a cloud• Privacy issues

• Public, private, or hybrid

• Fluctuating demand benefits from

elasticity

• Web-based access to clinicians simplifies

adoption

C4B

io 2

014

@C

CG

rid, -

P.M

issi

erWorkflow on Azure Cloud - configuration

<<Azure VM>>Azure Blob

store

e-SC db backend

<<Azure VM>>

e-Science Central

main server JMS queue

REST APIWeb UI

web browser

rich client app

workflow invocations

e-SC control data

workflow data

<<worker role>>Workflow

engine


engine

e-SC blob store


engine

Workflow engines

Top level workflowSub-workflows

Test configuration:3 nodes, 24 cores

C4B

io 2

014

@C

CG

rid, -

P.M

issi

erWorkflow and sub-workflows execution

To e-SC queue To e-SC queueExecutable Block

To e-SC queue


store

e-SC db backend

<<Azure VM>>

e-Science Central


REST APIWeb UI

web browser

rich client app


e-SC control data

workflow data


engine


engine

e-SC blob store


engine

Workflow invocation executing on one engine (fragment)

C4B

io 2

014

@C

CG

rid, -

P.M

issi

erMulti-sample processing

Sample list[S1…Sk]

Top level workflow

Variant files[VCF1…VCFk]Map semantics:

push K new workflow invocations to the e-SC queue


store

e-SC db backend

<<Azure VM>>

e-Science Central


REST APIWeb UI

web browser

rich client app


e-SC control data

workflow data


engine


engine

e-SC blob store


engine

BWA (S1)BWA (S2)…

C4B

io 2

014

@C

CG

rid, -

P.M

issi

erSub-workflows enqueued recursively

Exec blockSpecifies threading OS maps threads to available cores

Sub-workflowGets added to queue

Sub-workflowOne instance gets added to queue for each input sample

C4B

io 2

014

@C

CG

rid, -

P.M

issi

erPreliminary cost estimates

2 samples1 x 8 core30 hr @ £0.821 / h = £12.3 / sample

6 samples3 x 8 core47 hr @ £2.5 / h = £19 / sample

Cloud deployment makes cost easy to calculate

Trade-off:• Better flexibility, scalability• But loss of performance

Some tuning required

Cost model based on node uptime

Better node utilization• Larger sample batches

Remove unnecessary wait time• Make sub-workflows async

C4B

io 2

014

@C

CG

rid, -

P.M

issi

erSummary

• Whole-exome sequence processing on a cloud infrastructure• Windows Azure – project sponsor

• Tracking provenance as evidence and for change analysis

• Porting HPC scripted pipeline to workflow model and technology

• Scalability, Flexibility, Evolvability

c4bio paper talk

Technology

missier approach

missier workflow nesting

variant filtering

new tools

variant annotation

density of sequence

survey of tools

data amenable