c4bio paper talk
DESCRIPTION
presented at C4Bio workshop, May 26th 2014, ChicagoTRANSCRIPT
Gen
ome
Info
rmat
ics
2013
– P
.Mis
sier
From scripted HPC-based NGS pipelines to workflows on the cloud
Jacek Cała, Yaobo Xu, Eldarina Azfar Wijaya, Paolo Missier
School of Computing Science and Institute of Genetic MedicineNewcastle University, Newcastle upon Tyne, UK
C4Bio workshop @CCGrid 2014
Chicago, May 26th, 2014
C4B
io 2
014
@C
CG
rid, -
P.M
issi
erThe Cloud-e-Genome project
NGS data processing:
provide mechanisms to rapidly and flexibly create new exome sequence data processing pipelines, and to deploy them in a scalable way;
CostScalabilityFlexibility
Data to insightHuman variant interpretation for clinical diagnosis:provide clinicians with a tool for analysis and interpretation of human variants
• 2 year pilot project• Funded by UK’s National Institute for Health Research (NIHR)
through the Biomedical Research Council (BRC)• Nov. 2013: Cloud resources from Azure for Research Award
• 1 year’s worth of data/network/computing resources
Challenge:
to deliver the benefits of WES/WGS technology to clinical practice
C4B
io 2
014
@C
CG
rid, -
P.M
issi
erKey technical goals
• Scalability• In the rate and number of patient sequence submissions
• In the density of sequence data (from whole exome to whole genome)
• Flexibility, Traceability, Comparability across versions• Simplify experimenting with alternative pipelines (choice of tools, configuration
parameters)
• Trace each version and its executions
• Ability to compare results obtained using different pipelines and reason about the differences
• Openness. Simplify the process of adding:• New variant analysis tools
• New statistical methods for variant filtering, selection, and ranking
• Integration with third party databases
C4B
io 2
014
@C
CG
rid, -
P.M
issi
erApproach and testbed
Technical Approach:• double- porting
• Infrastructure: HPC cluster to cloud (IaaS)
• Implementation: NGS pipelines from scripts to workflow
• Implement user tools for clinical diagnosis as cloud apps (SaaS)
Testbed and scale:• Neurological patients from the North-East of England, focus on rare diseases
• Initial testing on about 300 sequences
• 2500-3000 sequences expected within 12 months
C4B
io 2
014
@C
CG
rid, -
P.M
issi
erWhy port to workflow?
• Programming:• Workflows provide better abstraction in the specification of pipelines
• Workflows directly executable by enactment engine
• Easier to understand, share, and maintain over time
• Flexible – relatively easy to introduce variations
• System: minimal installation/deployment requirements• Fewer dedicated technical staff hours required
• Automated dependency management, packaging, deployment
• Extensible by wrapping new tools
• Exploits available data parallelism (but not automagically)
• Reproducibility
• Execution monitoring, provenance collection• Persistence trace serves as evidence for data
• Amenable to automated analysis
C4B
io 2
014
@C
CG
rid, -
P.M
issi
erScripted pipeline
RecalibrationCorrects for system bias on quality scores assigned by sequencer
Computes coverage of each read.
VCF Subsetting by filtering, eg non-exomic variants
Annovar functional annotations (eg MAF, synonimity, SNPs…)followed by in house annotations
Aligns sample sequence to HG19 reference genomeusing BWA aligner
Cleaning, duplicate elimination
Picard tools
Variant calling operates on multiple samples simultaneouslySplits samples into chunks.Haplotype caller detects both SNV as well as longer indels
Variant recalibration attempts to reduce false positive rate from caller
C4B
io 2
014
@C
CG
rid, -
P.M
issi
erFrom scripts to workflows
C4B
io 2
014
@C
CG
rid, -
P.M
issi
erWorkflow nesting
C4B
io 2
014
@C
CG
rid, -
P.M
issi
erPipeline evolution
Pipeline:
set C = { c1 … cn } of components -- tool wrappers
Each ci has a configuration conf(ci) and a version v(ci)
…and why
• Technology / algorithm evolution• Traditional GATK variant caller
GATK haplotype caller• Does the interface change?• Do the operational assumptions
change?
Eg. GATK Variant Recalibrator requires large input data. Not suitable for targeted sequencing
What can change
1 – Tool version:v(ci) v’(ci)
2 - Tool replacement / add / remove:ci c’I
3 – Configuration parametersconf(ci) conf’(ci)
(*) S. Pabinger, A. Dander, M. Fischer, R. Snajder, M. Sperk, M. Efremova, B. Krabichler, M. R. Speicher, J. Zschocke, and Z. Trajanoski, “A survey of tools for variant analysis of next-generation genome sequencing data.” Briefings in bioinformatics, pp. bbs086–, Jan. 2013
Just for sequence alignment Pabinger et al. in their survey (*) list 17 aligners while for variant annotation they refer over 70 tools
C4B
io 2
014
@C
CG
rid, -
P.M
issi
erRole of provenance
Provenance refers to the sources of information, including entities and processes, involving in producing or delivering an artifact (*)
Provenance is a description of how things came to be, and how they came to be in the state they are in today (*)
• Provenance is evidence in support of clinical diagnosis1. Why do these variants appear in the output list?
2. Why have you concluded they are disease-causing?
• Requires ability to trace variants through workflow execution• Simple scripting lacks this functionality
“Where do these variants come from?”
“Why do these results differ?”
C4B
io 2
014
@C
CG
rid, -
P.M
issi
erComparing results across pipeline configurations
Run pipeline version V1
V1 V2:Replace BWA versionModify Annovar configuration parameters
Variant list VL1
Variant list VL2Run pipeline version V2
??
Variant list VL1
Variant list VL2
DDIFF(data differencing)
PDIFF(provenance differencing)
Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience (2013): doi:10.1002/cpe.3035.
C4B
io 2
014
@C
CG
rid, -
P.M
issi
erPDIFF - overview
WA
WB
C4B
io 2
014
@C
CG
rid, -
P.M
issi
erThe corresponding provenance traces
C4B
io 2
014
@C
CG
rid, -
P.M
issi
erDelta graph computed by PDIFF
PDIFF helps determine the impact of variations in the pipeline
C4B
io 2
014
@C
CG
rid, -
P.M
issi
erHPC Cluster configuration
16compute nodes
48/96GB RAM / 250GB disk
19TB usable storage space
Gigabit Ethernet
Shared resource for Institute-wide research
Submission script specifies node / core requirements
Computation waits until resources are available
Current config:
• BWA alignment: 2 cores
• GATK: 8 cores
C4B
io 2
014
@C
CG
rid, -
P.M
issi
erThe case for cloud in genome informatics (*)
(*) Stein, Lincoln D. “The Case for Cloud Computing in Genome Informatics.” Genome Biology 11, no. 5 (January 2010): 207.
• Storage + computing resources co-located
in a cloud• Privacy issues
• Public, private, or hybrid
• Fluctuating demand benefits from
elasticity
• Web-based access to clinicians simplifies
adoption
C4B
io 2
014
@C
CG
rid, -
P.M
issi
erWorkflow on Azure Cloud - configuration
<<Azure VM>>Azure Blob
store
e-SC db backend
<<Azure VM>>
e-Science Central
main server JMS queue
REST APIWeb UI
web browser
rich client app
workflow invocations
e-SC control data
workflow data
<<worker role>>Workflow
engine
<<worker role>>Workflow
engine
e-SC blob store
<<worker role>>Workflow
engine
Workflow engines
Top level workflowSub-workflows
Test configuration:3 nodes, 24 cores
C4B
io 2
014
@C
CG
rid, -
P.M
issi
erWorkflow and sub-workflows execution
To e-SC queue To e-SC queueExecutable Block
To e-SC queue
<<Azure VM>>Azure Blob
store
e-SC db backend
<<Azure VM>>
e-Science Central
main server JMS queue
REST APIWeb UI
web browser
rich client app
workflow invocations
e-SC control data
workflow data
<<worker role>>Workflow
engine
<<worker role>>Workflow
engine
e-SC blob store
<<worker role>>Workflow
engine
Workflow invocation executing on one engine (fragment)
C4B
io 2
014
@C
CG
rid, -
P.M
issi
erMulti-sample processing
Sample list[S1…Sk]
Top level workflow
Variant files[VCF1…VCFk]Map semantics:
push K new workflow invocations to the e-SC queue
<<Azure VM>>Azure Blob
store
e-SC db backend
<<Azure VM>>
e-Science Central
main server JMS queue
REST APIWeb UI
web browser
rich client app
workflow invocations
e-SC control data
workflow data
<<worker role>>Workflow
engine
<<worker role>>Workflow
engine
e-SC blob store
<<worker role>>Workflow
engine
BWA (S1)BWA (S2)…
C4B
io 2
014
@C
CG
rid, -
P.M
issi
erSub-workflows enqueued recursively
Exec blockSpecifies threading OS maps threads to available cores
Sub-workflowGets added to queue
Sub-workflowOne instance gets added to queue for each input sample
C4B
io 2
014
@C
CG
rid, -
P.M
issi
erPreliminary cost estimates
2 samples1 x 8 core30 hr @ £0.821 / h = £12.3 / sample
6 samples3 x 8 core47 hr @ £2.5 / h = £19 / sample
Cloud deployment makes cost easy to calculate
Trade-off:• Better flexibility, scalability• But loss of performance
Some tuning required
Cost model based on node uptime
Better node utilization• Larger sample batches
Remove unnecessary wait time• Make sub-workflows async
C4B
io 2
014
@C
CG
rid, -
P.M
issi
erSummary
• Whole-exome sequence processing on a cloud infrastructure• Windows Azure – project sponsor
• Tracking provenance as evidence and for change analysis
• Porting HPC scripted pipeline to workflow model and technology
• Scalability, Flexibility, Evolvability