software pipelines: the good, the bad and the ugly

João André Carriço, Microbiology Institute and Instituto de Medicina Molecular, Faculty of Medicine, University of [email protected] twitter: @jacarrico

Whole genome sequencing for clinical microbiology: Translation into routine applications2 September 2017, Basel

A pipeline (in software engineering) consists of a chain of processing elements arranged so that the output of each element is the input of the next; the name is by analogy to a physical pipeline

https://en.wikipedia.org/wiki/Pipeline_(software)

Physical pipeline

Software Pipeline

Software /Algorithm module

Microbiological

Sample

The Ideal Scenario

Magic Box of

NGS Wonders for

Clinical Microbiology

Completely characterized strain:

• Species Identification• Serotype• Multilocus Sequence Type (MLST)• cgMLST / wgMLST / SNPs • Antibiotic resistance profile• Virulence factors• Other SBTM information eg:

• spa (S. aureus)• emm (Group A Streptococcus)

Actionable information for :• Diagnostics • Surveillance • Outbreak detection

Magic Box of

NGS Wonders for

Clinical Microbiology

Pipelines

of

HTS

analysis

software

Comparability The same analysis workflow is

applied to multiple samples

Accountability Keeping track on what software

(and version) did the analysis

Modularity Adding new software to the pipeline

without changing the existing one

Bioinformatics Workflow software:

https://www.nextflow.io/

https://github.com/bionode/bionode-watermill

BionodeWatermill

Snakemake https://snakemake.readthedocs.io/en/stable/

Re-run as neededIf a module doesn’t run, there is no need to re-run the whole analysisCompatible with High Performance Computing job schedulers (SLURM , etc)

Software validation Most software contain bugs that can affect

the results. Pipelines can hamper tracking the problem

Reproducibility Running the same strain “should” yield the

same results but some software have stochastics steps

Opacity Given the dependency of multiple

software, it can be difficult to determine how the final results were achieved

Database dependency

Several bioinformatics software are dependent on publicly available and curated databases. Difficult to assess False Positives /False Negatives.

Virulence Factor Databases VFDB (http://www.mgc.ac.cn/VFs/main.htm)

Pathosystems Resource Integration Center (PATRIC) VF (https://www.patricbrc.org/)

Victors (http://www.phidias.us/victors/)

PHI-Base (http://www.phi-base.org/)

MvirDB (http://mvirdb.llnl.gov/ )

To know more: - Presentation on the Controversies in interpreting whole genome sequence data

session : http://eccmidlive.org/#resources/how-can-we-design-actionable-virulome-databases

Comprehensive Antibiotic Resistance Database (CARD) (https://card.mcmaster.ca/ )

Resfinder 2.1 (https://cge.cbs.dtu.dk/services/ResFinder/ ) (https://bitbucket.org/genomicepidemiology/resfinder_db ) -> DB repository

Repository of Antibiotic resistance Cassettes (RAC) (http://rac.aihi.mq.edu.au/rac/)

Integrall :The integron database (http://integrall.bio.ua.pt/)

(…)

https://card.mcmaster.ca/

https://cge.cbs.dtu.dk/services/ResFinder/

https://bitbucket.org/genomicepidemiology/resfinder_db

http://rac.aihi.mq.edu.au/rac/)

http://integrall.bio.ua.pt/)

Software dependencies If a software is updated and output

changes the pipeline breaks and needs to be revised

Database /URL format changes When Databases or URL where data is

stored in public repositories changes several software modules can be effected (a.k.a. the NCBI effect)

Setting up the pipeline Not as easy as it seems. The Bus effect .

Output of a software is used as input of another :

Most bioinformatics software are pipelines !

INNUCA Assembly Pipeline Prokka Genome Annotation Pipeline Nullarbor All in one Pipeline

Web platforms

Innuendo platform

https://www.cdc.gov/pulsenet/pathogens/wgs.html

Contamination

Mislabelling

E. c

oli

E. fergusonii

Mixture

Barcode bleaching

Wrong file assignment

http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

samtools http://www.htslib.org/

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

http://cab.spbu.ru/software/spades/

https://github.com/broadinstitute/pilon

MLST 2 https://github.com/tseemann/mlst

Dependencies :

Features :• Species confirmation• Contamination detection • Assembly correction• Multiple allele detection -> multiple strains

Spades

https://github.com/INNUENDOCON/INNUca

Output

20-40 mins per strain (60x-100x coverage; 8 CPUs)High Performance Cluster:6-7 nodes, 244 CPUs used : 3h57m for 124 E.coli ~=1.9 mins per strain

Benchmark

Contamination and multi-strain detection

Genome annotation made easy by TorstenSeemann (slides by Torsten)

Genome annotation: adding biological information to the sequence, by describing features

To know more :http://www.slideshare.net/torstenseemann/prokka-rapid-bacterial-genome-annotation-abphm-2013

Available at: https://github.com/tseemann/prokka

Complete pipeline from reads to reports by TorstenSeemann

Objective is automate analysis for everyday use on public health labs /research settings

Uses and distills outputs by a lot of software

Avaliable at: https://github.com/tseemann/nullarbor

Slide by Torsten Seeman

From: https://github.com/tseemann/nullarbor

Slides by Torsten Seeman

Web Platforms:

Facilitate the use of pipelines by non-bioinformaticians (the old and boring Windows vs Linux software debate can end (?) …)

Facilitate data sharing and comparison: Creation of Federated Strain Databases

A novel cross-sectorial platform for the

integration of genomics in surveillance of

foodborne pathogens

http://www.innuendoweb.org/

Target species:Escherichia coli Salmonella entericaYersinia enterocoliticaCampylobacter sp.

http://www.irida.ca/

INNUENDO Platform

Sequences Storage

LDAP

SLURM Job Scheduler

Computation Module

INNUca ReMatCh chewBBACA PHYLOViZ Online

Job Processing Application

Web Application

REST

API

Client Browser

(Chrome)Calculation

Server

REST

API Metadata

Storage

Frontend/ DB Server

NGS Onto

Slide credit:Bruno Gonçalves

Target users: Reference laboratories. Small groups.

• Multi-user• Create projects within a species for:

• Outbreaks• Surveillance

Applying multiple pipelines to the same strains and queue them for processing using SLURM.Can use an High Performance Computer if available

Aggregate selected strains from multiple projects into reports:• Reports can be saved and exported• Gene-by-gene analyses can be visualized directly into PHYLOViZ online

and and the resulting trees saved and shared.r N Closest strains in the database can be added to the tree automatically

Automatically adds the metadata filled in the project and several tree analysis can be performed :• NLV Graph• Interactive distance matrix• Dynamic exploration of wgMLST schemas

To know more: https://online.phyloviz.net/index

Input OutputSee-through box

See-through boxBlack box

Commercial/Freeware Freeware

You get what it gives you You can “tailor”

Ready to use “Major” headache

Stealth change Visible change

Standalone Dependencies

Slide credit: Mario Ramirez

Pipelines can provide actionable results for Clinical Microbiology out of HTS data

One must be aware of the limitations of each pipeline. Setting up a pipeline that can be maintainable needs Bioinformaticians.

Most are Linux based. But web platforms can provide a easy to use way to non-bioinformaticians and are useful to stimulate data sharing.

Pipelines greatly benefit from High Performance Computing Clusters. Nevertheless, these need specialized personal to install and maintain.

http://im.fm.ul.pt

INNUENDO project [GP/EFSA/AFSCO/2015/01/CT2]

BacGenTrack project [FCT / Scientific and Technological Research Council of Turkey, TUBITAK/0004/2014]

ONEIDA project (LISBOA-01-0145-FEDER-016417) co-funded by FEEI - “Fundos Europeus Estruturais e deInvestimento” from “Programa Operacional Regional Lisboa 2020” and by national funds from FCT -“Fundação para a Ciência e Tecnologia”

Disclaimer

The conclusions, findings, and opinions expressed in this presentation reflect only the

view of the INNUENDO consortium members and not the official position of the

European Food Safety Authority nor of the Government of the Basque Country that are

not responsible for any use that may be made of the information they contain.

software pipelines: the good, the bad and the ugly

Science