Dag Harmsen
Univ. Münster, [email protected]
EU PathoNGenTrace Consortium
cgMLST Evolvement and Challenges for Harmonization
19th November, 2015
Commercial Disclosure
Dag Harmsen is co-founder and partial owner of a
bioinformatics company (Ridom GmbH, Münster, Germany) that
develops software for DNA sequence analysis. Ridom and
Ion Torrent/Thermo Fisher (Waltham, MA) partnered and
released SeqSphere+ software to speed and simplify whole
genome based bacterial typing.
cgMLST Introduction
for
Outbreak Investigation
&
Global Nomenclature
Multiple Genome Alignment
(e.g., progressive Mauve)
k-merwithout alignment
ANIwith alignment
(Average Nucleotide Identity)
Genome-wide Mapping & SNP Calling
Genome-wide Gene by
Gene Allele Calling (cgMLST)
+ Works on read, draft & complete genome level, quickly identifies closest matching genome.
- Whole genome reduced to a single number of similarity.
- Additively expandable [≈ O(n)], but poor mapping tonomenclature possible.
- Difficult to interpret with draft genomes.
- Computational intensive (≧ O(n2), limit ≈ 30-50 genomes).
- Not additive expandable, no nomenclature possible.
+ Works well for monomorphic organisms and ‘ad hoc’ analysis & more discriminatory than cgMLST.
- Problematic with rearrangement / recombination events.
- Not additive expandable (at least if not always mapped to same reference).
+ Scalable, working on single gene to whole genome levels.
+ Both recombination & point mutation accommodated a single event.
+ Additively expandable [≈ O(1)] & nomenclature possible.
…A C
GGGATACATACCTATGCTATAGCT…
…ACGTGATACATACCTATGATATAGCT…
…ACGTGATACATACCTATGCTATAGCT…
Surveillance and Phylogeny from Draft Genomes‘Molecular Typing Esperanto’ by Standardized Genome Comparison
SNP, single nucleotide polymorphism; cgMLST, core genome multi locus sequence typing; n, number of isolates in database.
Alleles vs. Sequence/SNPs
ST1 = 1,1,1,1,1,1,1
ST2 = 1,2,1,1,1,1,1 ST3 = 1,1,1,2,1,1,1 ST4 = 1,1,1,1,1,1,2
A (clonal founder)
B (isolate) C D
A
B
C
D
Point mutationRecombination
Allelic profiles
A
B
C
D
Sequence data
The use of allelic profiles, rather than (concatenated) sequences or SNPs,
results in the loss of information (reductionist), but patterns of descent are more
robust to the effects of horizontal genetic transfer. Using for analysis just genes –
bacteria have a high coding capacity – avoids frequently repetitive intergenic
regions that are anyway with 2nd generation NGS data difficult to assemble.Modified from: Ed Feil; Univ. Bath, UK
each one genetic event
SNP vs. …
• M. tuberculosis
outbreak
• Reference mapped
against MtbC
H37Rv and SNP
calling.
• Outgroup strains
same MIRU-VNTR
type with no epi-
link. Kohl et al. (2014). JCM 52: 2479 [PubMed].
… cgMLST based Typing
• Reference mapped
against MtbC
H37Rv.
• Core genome
schema consists of
3,257 coding genes
(76.8% of whole
genome).
• 3,041 genes shared
by all 26 isolates
analyzed with
SeqSphere+.Kohl et al. (2014). JCM 52: 2479 [PubMed].
Enterococcus: de Been. Et al. (2015). JCM pii: JCM.01946-15 [PubMed].
Genome-wide Genes vs. Whole Genome Consensus
Sequencing and Assembling Mismatches
cgMLST Evolvement
Jolley & Maiden (November, 2010). BMC Bioinformatics.
11: 595 [PubMed].
ClonalFrame trees were generated from 43 streptococcal
genome sequences, i.e., from concatenated sequences,
using A seven MLSA gene fragment loci and B 77 complete
genes found to be present throughout the genus identified
by BIGSdb.
Mellmann et al. (July, 2011). PLoS One. 6: e22751 [PubMed].
Phylogenetic Analysis of EHEC 0104:H4
Method
First real-time prospective outbreak genomics
outbreak analysis. Hybrid assembly from reference
mapping & de novo assembly with Ion Torrent PGM
WGS data and BIGSdb genome-wide gene-by gene
allele calling against a fixed set of loci/targets
n = 1,144 STEC core genome gene scheme defined
before outbreak analysis and SeqSphere minimum-
spanning tree (not yet termed so but first cgMLST
application; internally called at that time ‘super MLST’
and/or ‘MLST on steroids’)
Grant agreement number:
278864-2
EC contribution:
5.995.267 €
Duration:
54 months (01/01/2012 - 30/06/2016)
Funding scheme:
SME-targeted Collaborative Project
URL:
http://www.patho-ngen-trace.eu/
Scientific Advisory BoardMarc J. StruelensEuropean Centre for Disease Prevention and Control (ECDC), Stockholm, Sweden
Rene S. HendriksenTechnical University of Denmark - National Food Institute, Lyngby, Denmark
Stephen H. Gillespie University of St Andrews, St Andrews, Scotland UK
Gary Van DomselaarNational Microbiology Laboratory Public Health Agency of Canada
Consortium
Dag Harmsen
Universität Münster
Stefan Niemann
Coordinator, FZ Borstel
Philip Supply
Genoscreen
Martin C.J. Maiden
University of Oxford
Bruno Pot
Applied Maths NV
Jörg Rothgänger
Ridom GmbH
Ronald Burggrave
Piext BV
Claudia Giehl
Eurice GmbH
Associated Partners
Alexander MellmannUniv. Münster, Germany
Roland DielUniv. Kiel, Germany
Joao CarricoUniv. Lisbon, Portugal
Main Objectives
• Develop new, completely integrated bioinformatics
microbial genomics tools for: fast and easy quality-
controlled data extraction interpretation for general
diagnostics and public health applications
• Streamline and implement new quality control
procedures of the whole genomics process
• Test and validate the performances of NGS for early
diagnosing and monitoring the spread of major microbial
pathogens
Work-Packages
• WP1. Development of easy to use software tools for whole genome comparison
(Leader: Applied Maths, Partner: Ridom, Oxford; User: Genoscreen, Münster, Borstel,Oxford)
• WP2. Next generation high throughput genome wide analysis – new technologies andoptimization
(Leader: Genoscreen, Partners: Münster, [Ion Torrent]; User: Borstel, Münster, Oxford)
• WP3. Use of whole genome sequencing & ODM for genotyping of MtbC
(Leader: Borstel, Partner: Genoscreen, Oxford, PiEXT [OpGen], Münster)
• WP4. Use of whole genome sequencing & ODM for genotyping of MRSA
(Leader: Münster, Partner: Genoscreen, Oxford, PiEXT [+OpGen])
• WP5. Use of whole genome sequencing & ODM for genotyping of Campylobacter
(Leader: Oxford, Partner: Genoscreen, Münster, PiEXT [+OpGen])
• WP6. Innovation related activities (IP, Dissemination, and Exploitation)
(Leader: Eurice, Partner: all)
• WP7. Management
(Leader: Eurice, Partner: all)
Prospective Real-time Studies
• Campylobacter: prospective surveillance in Oxfordshire,
UK has been ongoing with WGS data since 2010 – moving
to more real-time starting from 2015 (700-900 isolates
per year).
• MtbC: prospective surveillance in Hamburg, DE has been
ongoing with WGS data since 2005 – moving to more real-
time starting from 2015 (110-130 isolates per year).
• MRSA: prospective real-time (TaT 4-5 days) surveillance
of all multi-drug resistant bacteria (MDR; including MRSA)
of University Hospital Münster, DE since October 2013
(1,200-1,500 isolates per year).
Jolley et al. (April, 2012). Microbiology 158: 1005 [PubMed].
In 2013/2014 also rMLST STs added.
Jünemann et al. (April, 2013). Nature Biotechnology 31: 294 [PubMed].
Evaluation of contiguity and consensus accuracy
of draft de novo assemblies from benchtop
sequencers. a) evolution of genome contiguity for
GSJ, MiSeq and PGM. The contiguity of the de
novo assembly consensus sequences generated
by MIRA was analyzed for 4,671 non-pseudo- or
non-paralogous chromosomal coding E. coli
Sakai NCBI reference sequence genes. This
genome-wide gene-by-gene allele analysis
was performed with the Ridom SeqSphere+
software. (b) Venn diagram of consensus
sequencing accuracy for PGM 300 bp, MiSeq 2
× 250-bp PE (MIS) and GSJ. reported
consensus errors were analyzed for 4,632 coding
Sakai genes that could be retrieved using
SeqSphere+ for all three platforms. Numbers of
variants confirmed by bidirectional sanger
sequencing are indicated in parentheses.
*Avoidance of the term core genome as core genome genes are here determined from DNA
with rather high similarity values!
*
Maiden et al. (October, 2013). Nature Rev. Microbiol. 11: 728 [PubMed].
PathoNGenTrace Yearly Meeting (May 13th - 14th, 2013). Cambridge, UK.Bruno Pot and Hannes Pouseele, Applied Maths. Kmers are the ways how to
compare genomes (work done together with Ilya Chorny, Illumina).
IMMEM X (October 2nd - 5th, 2013). Paris, France.Hannes Pouseele, Applied Maths. Seven ways (= one of
them wgMLST) how to leave your lover (= PFGE).
cgMLST at that time for the
authors NOT a fixed set of
loci but ‘shared’ loci of
selected isolates under
study.
Kohl et al. (April, 2014). JCM 52: 2479 [PubMed].
First original publication using the term cgMLST and using a fixed genome-wide set of genes.
Tools for Microbial
Genotypic Surveillance and Phylogeny
Wyres et al. (2014). WGS analysis and interpretation in clinical and public health microbiology laboratories:
what are the requirements and how do existing tools compare? Pathogens 3: 437 [doi:10.3390/pathogens3020437].
__________________________
ENA Sub- Included Nomen-
mission Database clature
__________________________
__________________________
Yes Yes No
No No No
No No No
__________________________
No Yes Yes
No No No
Yes Yes Yes
No No No
__________________________
WWW
WWW
WWW
WWW
Standardized Hierarchical Microbial WGS Typing
Pan-bacterial-specific
Jolley et al. (2012). Microbiology 158: 1005 [PubMed]
Global Nomenclature / Surveillance
rMLST
Species-specificSTEC: Mellmann et al. (2011). PLoS One. 6: e22751 [PubMed]
S. aureus: Leopold et al. (2014). JCM 52: 2365 [PubMed]
MtbC: Kohl et al. (2014). JCM 52: 2479 [PubMed]
K. pneumo.: Bialek et al. (2014). EID 20: 1812 [PubMed]
Lp: Moran-Gildad et al. (2015). Euro Surveill. 20: pii: 21186 [PubMed]
Listeria: Ruppitsch et al. (2015). JCM 53: 2869 [PubMed]
E. faecium: de Been et al. (2015). JCM 53: [PubMed]
cgMLST
MLST
SNPsconfirmatory/canonical
Standardized hierarchical microbial WGS typing approach. From bottom to
top with increasing discriminatory power. MLST, multi locus sequence typing;
rMLST, ribosomal MLST; cgMLST, core genome MLST; wgMLST, whole
genome MLST, and SNP, single nucleotide polymorphism.
Species-specific
e.g., Van Ert et al. (2007). JCM 45: 47 [PubMed]
Maiden et al. (1998). PNAS 95: 3140 [PubMed]
also needed for backwards compatibility
Dis
crim
ina
tory
Pow
er
Speciation by rMLST
Evolutionary Analysis
SNPs*
Allelesfrom accessory reference ge-
nome genes or pan-genome
based wgMLST
Local Outbreak InvestigationOutbreak- / Lineage-specificSNPe.g., Köser et al (2012). NEJM 366: 2267 [PubMed]
wgMLST/’shared’ genomeN. meng.: Jolley et al. (2012). JCM. 50: 3046 [PubMed]
C. jejuni: Cody et al. (2013). JCM. 51: 2526 [PubMed]
*from de novo assembled [PubMed] and/or mapped genomes
cgMLST Challenges for Harmonization
cgMLST and API/Ontology Workshop
Organization: Martin Maiden and Dag Harmsen
Date: 2nd & 3rd March, 2015
Place: Oxford University, UK
Participants: Oxford Univ., Univ. Münster, FZ Borstel, Univ. Warwick,
Inst. Pasteur, Univ. Lisboa, PHE, CDC, Ridom, and Applied Maths
Informal agreement that cgMLST is a fixed and in the community
agreed upon set of genome-wide genes that is going to be at least the
minimum denominator for analyzing whole genome shotgun (WGS)
sequence data for surveillance purposes!
ECDC (October, 2015).
http://ecdc.europa.eu/en/publications/Publications/food-and-waterborne-diseases-next-generation-typing-methods.pdf.
Describes a top-down
approach that includes also
several tears of reporting (e.g.,
national and international).
However, in the past the most
successful bacterial genotyping
initiatives (e.g., MLST, spa-
typing, or MIRU-VNTR)
followed a bottom-up - grass-
root basic democratic or even
anarchic - approach.
Only the PulseNet imitative
followed a top-down approach
but never resulted in a public
nomenclature and involved
‘heavy’ investment by CDC.
Nomenclature is in its essence a technique to reduce the
amount of available information by assigning a short, yet
still informative human [and machine] readable code to
isolates. Where two isolates share the same code, it implies
that they have the same properties as defined by the
nomenclature scheme that is assumed to be commonly
understood and adhered to.
An additional step in assigning allele identifiers to a
particular set of loci, which also further reduces the
information to a degree that it can be used effectively for
human communication, is to assign an additional unique
identifier to each combination of alleles observed within a
single genome.
Nomenclature Assignment
ECDC (October, 2015). Expert Opinion on the introduction of next-generation typing methods for food- and waterborne diseases in the EU and EEA.
http://ecdc.europa.eu/en/publications/Publications/food-and-waterborne-diseases-next-generation-typing-methods.pdf.
wgMLST principle: assignment of unique allele identifiers.
ECDC (October, 2015). Expert Opinion on the introduction of next-generation typing methods for food- and waterborne diseases in the EU and EEA.
http://ecdc.europa.eu/en/publications/Publications/food-and-waterborne-diseases-next-generation-typing-methods.pdf.
infinite growing*
*SeqSphere+ only uses the accessory genome of the ‘reference genome’.
BIGSdb and Bionumerics use the accessory genome of the pan genome. Furthermore,
for detecting loci/targets by similarity and overlap BIGSdb scans new draft genomes
against all alleles of a locus and not only against the allele of the ‘reference genome’ as
done by SeqSphere+ and Bionumerics. Thereby different results might be obtained
depending when the search was conducted (‘triangulation problem’).
Cluster/outbreak threshold calibration only possible on cgMLST level!
wgMLST Nomenclature
• MLST sequence type (ST) and clonal complex (CC) concept must and will be remain (among many others
reasons for backwards compatibility).
• For NGS genome-wide gene by gene allele typing with hundreds/thousands of genes/targets from a ‘WGS
typing scheme’ or with ‘core genome genes’ the allele nomenclature for every target/gene must be
controlled.
• For communication between humans (e.g. publication) and to make the results comparable on an
international scale the nomenclature of specific combinations of hundreds/thousands targets/genes
must also be controlled.
• For these specific combinations of hundreds/thousands targets/genes the term Cluster Type (CT) is
proposed.
• CT will be much more discriminatory than a ST; definition is mainly needed for outbreak
investigation/transmission chain analysis.
• CT concept must be able to cope with:
• some missing targets/genes (either not present or not sequenced by chance or not assembled well),
• a few target/gene allele differences due to NGS sequencing errors, intra-host variation and/or
micro-evolutionary changes during an outbreak, and
• different bacterial population structures (e.g., monomorphic vs. panmictic structure) and infection
dynamics (e.g., incubation period and/or transmission mode). Therefore, a CT will be species
specific.
• CT threshold is pragmatically defined as the highest observed number of allele differences in intra-
patient, consecutive and/or outbreak isolates plus 25% number of alleles (rounded) to rule-out recent
transmission for sure.
• As the CT will be ‘just’ a number and there will be no biological meaningful relationship between the CT
numbers – otherwise a single expanding nomenclature would be impossible – it is proposed to associate
with every CT the date and location (city and country) of isolation (e.g. CT 399; March 2013, Münster
Germany).
• As a CT will be specific for a ‘WGS typing scheme’ (cgMLST), it is proposed to use e.g. the phrase
Ridom cgMLST CT.
WGS Cluster Type (CT)
Problems due to:
• additive expansion
• missing data
• entry order
Taxonomical nomenclature principle based on SNP or wgMLST dendrogram.*
*Desirable BUT hardly possible for an additive expandable nomenclature system as there will be
always changes in the tree (was not possible in the past with MLST or canonical SNPs of monomorphic
bacteria; would violate stability of nomenclature). Furthermore, if done with ‘SNP addresses’ and not
with alleles very compute intensive to calculate.
ECDC (October, 2015). Expert Opinion on the introduction of next-generation typing methods for food- and waterborne diseases in the EU and EEA.
http://ecdc.europa.eu/en/publications/Publications/food-and-waterborne-diseases-next-generation-typing-methods.pdf.
Taxonomical/Phylogenetic Nomenclature
Vaz et al. (October, 2014). J Biomed Semantics 5: 43 [PubMed]
cgMLST Nomenclature Harmonization
The TypON microbial typing ontology foresees immediately a
REST application programming interface (API) for cgMLST allele
nomenclature services that allows software tools to bi-directional
communicate with each other.
cgMLST Nomenclature Server(s)
SeqSphere+: Query and authentication API to be released into public early 2016.
Submission for SeqSphere+ users already since 2013 possible without any manual curation
steps involved. Submission API for other tools foreseen for mid 2016.
BIGSdb: Query and authentication API available since mid 2015. Submission API
announced October 2015 (evaluation needed and SeqSphere+ and Bionumerics must
‘emulate’ BIGSdb mode of allele calling).
Other PathoNGenTrace Bioinformatics Activities
WGS
Genotyping
Standardization
Visualization of four dimensions
and
Early Warning
From WGS Geno- to
Phenotype Prediction
(resistome & virulome analysis)
From WGS to
Plain Language Report
http://patho-ngen-trace.eu/
SeqSphere+ Visualization of Four Dimensionsreleased with version 3.0 early October 2015 (also MLST+ term no longer used since then)
Place
Ruppitsch et al. J Clin Microbiol. 2015; 53: 2869 [PubMed].
Time
#Missing values Sample ID
Good Targets ST
Collection Date
Country of Isolation
City of Isolation
ZIP of Isolation
Cluster Type
4 12025647 99.8 398 unknown Austria ? (unknown) ? (unknown) 45
4 2010-00770 99.8 398 Feb 2, 2010 Austria Hartberg 8230 39
4 3230TP3 99.8 398 Jan 22, 2010 Austria Hartberg 8230 39
3 3230TP5 99.8 403 Jan 22, 2010 Austria Hartberg 8230 35
8 CIP105458 99.5 2 1959 USA ? (unknown) ? (unknown) 49
0 EGD-e 100.0 35 1924 United Kingdom Cambridge ? (unknown) 1
4 L10-10 99.8 398 Jan 12, 2010 Austria Zell am See 5700 39
4 L14-10 99.8 398 Jan 25, 2010 Austria Rohrbach 4150 39
4 L16-10 99.8 398 Jan 29, 2010 Austria Salzburg 5020 39
4 L17-10 99.8 398 Jan 30, 2010 Austria Krems 3500 39
4 L30-10 99.8 398 Jan 25, 2010 Austria Ried im Innkreis 4910 39
4 L33-10 99.8 398 Feb 22, 2010 Austria St. Pölten 3100 39
5 L38-11 99.7 398 2010 Austria Vienna 1010 41
4 L4-10 99.8 398 Dec 23, 2009 Austria Gänserndorf 2230 39
2 L71-09 99.9 403 Dec 10, 2009 Austria Mattersburg 7210 35
5 L75-09 99.7 398 Dec 16, 2009 Austria Vienna 1100 39
‘Person‘ by color
Type
All four dimensions
views are inter-linked
interactively and ex-
portable in publication
quality scalable vector
graphics (SVG) format.
allele calling (<5min)
Pure bacterial culture / single cell
DNA (≈3.5h)
Rapid NGS (≈28-43h)
De novo or reference
assisted assembly (<1h)
Phenotypic and
epidemiologic
information
LIMS(e.g., via
Excel file or
HL7)
One Disruptive Technology Fits it All –
Genomic Surveillance and More
MLST/rMLST
Evolutionary
analysis
Resistome /
Virulome
Surveillance &
outbreak
investigation
cgMLSTSNP /
accessory targetsAntibiotic
resistance targets
Toxins &
pathogenicity targets
Standardized
hierarchical microbial
typing and more
EBI ENA(Backup raw
reads)
cgMLST
Nomen-
clature
Server
Dissemination Activities
2nd ConferenceRapid Microbial NGS and Bioinformatics: Translation Into Practice
The event will gather experts from all over the world active in applying Next Generation Sequencing (NGS) techniques to discover the epidemiology, anti-microbial resistance, ecology and evolution of microorganisms. The program will be designed to build a bridge between software developers and end-users.
At a GlanceDate: June 9-11, 2016Place: Hamburg, GermanyComplete program to be announced online soon. Registration: will open mid December 2015 at: www.RaMi-NGS.orgContact: For questions or further information please send an email to [email protected]
The research from the PathoNGen-Trace project has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under Grant Agreement N° 278864.
Dag Harmsen
Univ. Münster, [email protected]
cgMLST Evolvement and Challenges for Harmonization
19th November, 2015
EU PathoNGenTrace Consortium