software pipelines: the good, the bad and the ugly
TRANSCRIPT
João André Carriço, Microbiology Institute and Instituto de Medicina Molecular, Faculty of Medicine, University of [email protected] twitter: @jacarrico
Whole genome sequencing for clinical microbiology: Translation into routine applications2 September 2017, Basel
A pipeline (in software engineering) consists of a chain of processing elements arranged so that the output of each element is the input of the next; the name is by analogy to a physical pipeline
https://en.wikipedia.org/wiki/Pipeline_(software)
Microbiological
Sample
The Ideal Scenario
Magic Box of
NGS Wonders for
Clinical Microbiology
Completely characterized strain:
• Species Identification• Serotype• Multilocus Sequence Type (MLST)• cgMLST / wgMLST / SNPs • Antibiotic resistance profile• Virulence factors• Other SBTM information eg:
• spa (S. aureus)• emm (Group A Streptococcus)
Actionable information for :• Diagnostics • Surveillance • Outbreak detection
Comparability The same analysis workflow is
applied to multiple samples
Accountability Keeping track on what software
(and version) did the analysis
Modularity Adding new software to the pipeline
without changing the existing one
Bioinformatics Workflow software:
https://www.nextflow.io/
https://github.com/bionode/bionode-watermill
BionodeWatermill
Snakemake https://snakemake.readthedocs.io/en/stable/
Re-run as neededIf a module doesn’t run, there is no need to re-run the whole analysisCompatible with High Performance Computing job schedulers (SLURM , etc)
Software validation Most software contain bugs that can affect
the results. Pipelines can hamper tracking the problem
Reproducibility Running the same strain “should” yield the
same results but some software have stochastics steps
Opacity Given the dependency of multiple
software, it can be difficult to determine how the final results were achieved
Database dependency
Several bioinformatics software are dependent on publicly available and curated databases. Difficult to assess False Positives /False Negatives.
Virulence Factor Databases VFDB (http://www.mgc.ac.cn/VFs/main.htm)
Pathosystems Resource Integration Center (PATRIC) VF (https://www.patricbrc.org/)
Victors (http://www.phidias.us/victors/)
PHI-Base (http://www.phi-base.org/)
MvirDB (http://mvirdb.llnl.gov/ )
To know more: - Presentation on the Controversies in interpreting whole genome sequence data
session : http://eccmidlive.org/#resources/how-can-we-design-actionable-virulome-databases
Comprehensive Antibiotic Resistance Database (CARD) (https://card.mcmaster.ca/ )
Resfinder 2.1 (https://cge.cbs.dtu.dk/services/ResFinder/ ) (https://bitbucket.org/genomicepidemiology/resfinder_db ) -> DB repository
Repository of Antibiotic resistance Cassettes (RAC) (http://rac.aihi.mq.edu.au/rac/)
Integrall :The integron database (http://integrall.bio.ua.pt/)
(…)
Software dependencies If a software is updated and output
changes the pipeline breaks and needs to be revised
Database /URL format changes When Databases or URL where data is
stored in public repositories changes several software modules can be effected (a.k.a. the NCBI effect)
Setting up the pipeline Not as easy as it seems. The Bus effect .
INNUCA Assembly Pipeline Prokka Genome Annotation Pipeline Nullarbor All in one Pipeline
Web platforms
Innuendo platform
https://www.cdc.gov/pulsenet/pathogens/wgs.html
Contamination
Mislabelling
E. c
oli
E. fergusonii
Mixture
Barcode bleaching
Wrong file assignment
http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
samtools http://www.htslib.org/
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
http://cab.spbu.ru/software/spades/
https://github.com/broadinstitute/pilon
MLST 2 https://github.com/tseemann/mlst
Dependencies :
Features :• Species confirmation• Contamination detection • Assembly correction• Multiple allele detection -> multiple strains
Spades
https://github.com/INNUENDOCON/INNUca
Output
20-40 mins per strain (60x-100x coverage; 8 CPUs)High Performance Cluster:6-7 nodes, 244 CPUs used : 3h57m for 124 E.coli ~=1.9 mins per strain
Benchmark
Contamination and multi-strain detection
Genome annotation made easy by TorstenSeemann (slides by Torsten)
Genome annotation: adding biological information to the sequence, by describing features
To know more :http://www.slideshare.net/torstenseemann/prokka-rapid-bacterial-genome-annotation-abphm-2013
Available at: https://github.com/tseemann/prokka
Complete pipeline from reads to reports by TorstenSeemann
Objective is automate analysis for everyday use on public health labs /research settings
Uses and distills outputs by a lot of software
Avaliable at: https://github.com/tseemann/nullarbor
Web Platforms:
Facilitate the use of pipelines by non-bioinformaticians (the old and boring Windows vs Linux software debate can end (?) …)
Facilitate data sharing and comparison: Creation of Federated Strain Databases
A novel cross-sectorial platform for the
integration of genomics in surveillance of
foodborne pathogens
http://www.innuendoweb.org/
Target species:Escherichia coli Salmonella entericaYersinia enterocoliticaCampylobacter sp.
http://www.irida.ca/
INNUENDO Platform
Sequences Storage
LDAP
SLURM Job Scheduler
Computation Module
INNUca ReMatCh chewBBACA PHYLOViZ Online
Job Processing Application
Web Application
REST
API
Client Browser
(Chrome)Calculation
Server
REST
API Metadata
Storage
Frontend/ DB Server
NGS Onto
Slide credit:Bruno Gonçalves
Target users: Reference laboratories. Small groups.
Applying multiple pipelines to the same strains and queue them for processing using SLURM.Can use an High Performance Computer if available
Aggregate selected strains from multiple projects into reports:• Reports can be saved and exported• Gene-by-gene analyses can be visualized directly into PHYLOViZ online
and and the resulting trees saved and shared.r N Closest strains in the database can be added to the tree automatically
Automatically adds the metadata filled in the project and several tree analysis can be performed :• NLV Graph• Interactive distance matrix• Dynamic exploration of wgMLST schemas
To know more: https://online.phyloviz.net/index
Input OutputSee-through box
See-through boxBlack box
Commercial/Freeware Freeware
You get what it gives you You can “tailor”
Ready to use “Major” headache
Stealth change Visible change
Standalone Dependencies
Slide credit: Mario Ramirez
Pipelines can provide actionable results for Clinical Microbiology out of HTS data
One must be aware of the limitations of each pipeline. Setting up a pipeline that can be maintainable needs Bioinformaticians.
Most are Linux based. But web platforms can provide a easy to use way to non-bioinformaticians and are useful to stimulate data sharing.
Pipelines greatly benefit from High Performance Computing Clusters. Nevertheless, these need specialized personal to install and maintain.
INNUENDO project [GP/EFSA/AFSCO/2015/01/CT2]
BacGenTrack project [FCT / Scientific and Technological Research Council of Turkey, TUBITAK/0004/2014]
ONEIDA project (LISBOA-01-0145-FEDER-016417) co-funded by FEEI - “Fundos Europeus Estruturais e deInvestimento” from “Programa Operacional Regional Lisboa 2020” and by national funds from FCT -“Fundação para a Ciência e Tecnologia”
Disclaimer
The conclusions, findings, and opinions expressed in this presentation reflect only the
view of the INNUENDO consortium members and not the official position of the
European Food Safety Authority nor of the Government of the Basque Country that are
not responsible for any use that may be made of the information they contain.