Download - EU PathoNGenTraceConsortium:cgMLST Evolvement and Challenges for Harmonization

Dag Harmsen

Univ. Münster, [email protected]

EU PathoNGenTrace Consortium

cgMLST Evolvement and Challenges for Harmonization

19th November, 2015

Commercial Disclosure

Dag Harmsen is co-founder and partial owner of a

bioinformatics company (Ridom GmbH, Münster, Germany) that

develops software for DNA sequence analysis. Ridom and

Ion Torrent/Thermo Fisher (Waltham, MA) partnered and

released SeqSphere+ software to speed and simplify whole

genome based bacterial typing.

cgMLST Introduction

for

Outbreak Investigation

&

Global Nomenclature

Multiple Genome Alignment

(e.g., progressive Mauve)

k-merwithout alignment

ANIwith alignment

(Average Nucleotide Identity)

Genome-wide Mapping & SNP Calling

Genome-wide Gene by

Gene Allele Calling (cgMLST)

+ Works on read, draft & complete genome level, quickly identifies closest matching genome.

- Whole genome reduced to a single number of similarity.

- Additively expandable [≈ O(n)], but poor mapping tonomenclature possible.

- Difficult to interpret with draft genomes.

- Computational intensive (≧ O(n2), limit ≈ 30-50 genomes).

- Not additive expandable, no nomenclature possible.

+ Works well for monomorphic organisms and ‘ad hoc’ analysis & more discriminatory than cgMLST.

- Problematic with rearrangement / recombination events.

- Not additive expandable (at least if not always mapped to same reference).

+ Scalable, working on single gene to whole genome levels.

+ Both recombination & point mutation accommodated a single event.

+ Additively expandable [≈ O(1)] & nomenclature possible.

…A C

GGGATACATACCTATGCTATAGCT…

…ACGTGATACATACCTATGATATAGCT…

…ACGTGATACATACCTATGCTATAGCT…

Surveillance and Phylogeny from Draft Genomes‘Molecular Typing Esperanto’ by Standardized Genome Comparison

SNP, single nucleotide polymorphism; cgMLST, core genome multi locus sequence typing; n, number of isolates in database.

Alleles vs. Sequence/SNPs

ST1 = 1,1,1,1,1,1,1

ST2 = 1,2,1,1,1,1,1 ST3 = 1,1,1,2,1,1,1 ST4 = 1,1,1,1,1,1,2

A (clonal founder)

B (isolate) C D

A

B

C

D

Point mutationRecombination

Allelic profiles

A

B

C

D

Sequence data

The use of allelic profiles, rather than (concatenated) sequences or SNPs,

results in the loss of information (reductionist), but patterns of descent are more

robust to the effects of horizontal genetic transfer. Using for analysis just genes –

bacteria have a high coding capacity – avoids frequently repetitive intergenic

regions that are anyway with 2nd generation NGS data difficult to assemble.Modified from: Ed Feil; Univ. Bath, UK

each one genetic event

SNP vs. …

• M. tuberculosis

outbreak

• Reference mapped

against MtbC

H37Rv and SNP

calling.

• Outgroup strains

same MIRU-VNTR

type with no epi-

link. Kohl et al. (2014). JCM 52: 2479 [PubMed].

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve&dopt=AbstractPlus&list_uids=24789177

… cgMLST based Typing

• Reference mapped

against MtbC

H37Rv.

• Core genome

schema consists of

3,257 coding genes

(76.8% of whole

genome).

• 3,041 genes shared

by all 26 isolates

analyzed with

SeqSphere+.Kohl et al. (2014). JCM 52: 2479 [PubMed].

Enterococcus: de Been. Et al. (2015). JCM pii: JCM.01946-15 [PubMed].


http://www.ncbi.nlm.nih.gov/pubmed/26400782

Genome-wide Genes vs. Whole Genome Consensus

Sequencing and Assembling Mismatches

cgMLST Evolvement

Jolley & Maiden (November, 2010). BMC Bioinformatics.

11: 595 [PubMed].

ClonalFrame trees were generated from 43 streptococcal

genome sequences, i.e., from concatenated sequences,

using A seven MLSA gene fragment loci and B 77 complete

genes found to be present throughout the genus identified

by BIGSdb.


Mellmann et al. (July, 2011). PLoS One. 6: e22751 [PubMed].

Phylogenetic Analysis of EHEC 0104:H4

Method

First real-time prospective outbreak genomics

outbreak analysis. Hybrid assembly from reference

mapping & de novo assembly with Ion Torrent PGM

WGS data and BIGSdb genome-wide gene-by gene

allele calling against a fixed set of loci/targets

n = 1,144 STEC core genome gene scheme defined

before outbreak analysis and SeqSphere minimum-

spanning tree (not yet termed so but first cgMLST

application; internally called at that time ‘super MLST’

and/or ‘MLST on steroids’)


Grant agreement number:

278864-2

EC contribution:

5.995.267 €

Duration:

54 months (01/01/2012 - 30/06/2016)

Funding scheme:

SME-targeted Collaborative Project

URL:

http://www.patho-ngen-trace.eu/

Scientific Advisory BoardMarc J. StruelensEuropean Centre for Disease Prevention and Control (ECDC), Stockholm, Sweden

Rene S. HendriksenTechnical University of Denmark - National Food Institute, Lyngby, Denmark

Stephen H. Gillespie University of St Andrews, St Andrews, Scotland UK

Gary Van DomselaarNational Microbiology Laboratory Public Health Agency of Canada

http://www.patho-ngen-trace.eu/

Consortium

Dag Harmsen

Universität Münster

Stefan Niemann

Coordinator, FZ Borstel

Philip Supply

Genoscreen

Martin C.J. Maiden

University of Oxford

Bruno Pot

Applied Maths NV

Jörg Rothgänger

Ridom GmbH

Ronald Burggrave

Piext BV

Claudia Giehl

Eurice GmbH

Associated Partners

Alexander MellmannUniv. Münster, Germany

Roland DielUniv. Kiel, Germany

Joao CarricoUniv. Lisbon, Portugal

Main Objectives

• Develop new, completely integrated bioinformatics

microbial genomics tools for: fast and easy quality-

controlled data extraction interpretation for general

diagnostics and public health applications

• Streamline and implement new quality control

procedures of the whole genomics process

• Test and validate the performances of NGS for early

diagnosing and monitoring the spread of major microbial

pathogens

Work-Packages

• WP1. Development of easy to use software tools for whole genome comparison

(Leader: Applied Maths, Partner: Ridom, Oxford; User: Genoscreen, Münster, Borstel,Oxford)

• WP2. Next generation high throughput genome wide analysis – new technologies andoptimization

(Leader: Genoscreen, Partners: Münster, [Ion Torrent]; User: Borstel, Münster, Oxford)

• WP3. Use of whole genome sequencing & ODM for genotyping of MtbC

(Leader: Borstel, Partner: Genoscreen, Oxford, PiEXT [OpGen], Münster)

• WP4. Use of whole genome sequencing & ODM for genotyping of MRSA

(Leader: Münster, Partner: Genoscreen, Oxford, PiEXT [+OpGen])

• WP5. Use of whole genome sequencing & ODM for genotyping of Campylobacter

(Leader: Oxford, Partner: Genoscreen, Münster, PiEXT [+OpGen])

• WP6. Innovation related activities (IP, Dissemination, and Exploitation)

(Leader: Eurice, Partner: all)

• WP7. Management

(Leader: Eurice, Partner: all)

Prospective Real-time Studies

• Campylobacter: prospective surveillance in Oxfordshire,

UK has been ongoing with WGS data since 2010 – moving

to more real-time starting from 2015 (700-900 isolates

per year).

• MtbC: prospective surveillance in Hamburg, DE has been

ongoing with WGS data since 2005 – moving to more real-

time starting from 2015 (110-130 isolates per year).

• MRSA: prospective real-time (TaT 4-5 days) surveillance

of all multi-drug resistant bacteria (MDR; including MRSA)

of University Hospital Münster, DE since October 2013

(1,200-1,500 isolates per year).

Jolley et al. (April, 2012). Microbiology 158: 1005 [PubMed].

In 2013/2014 also rMLST STs added.


Jünemann et al. (April, 2013). Nature Biotechnology 31: 294 [PubMed].

Evaluation of contiguity and consensus accuracy

of draft de novo assemblies from benchtop

sequencers. a) evolution of genome contiguity for

GSJ, MiSeq and PGM. The contiguity of the de

novo assembly consensus sequences generated

by MIRA was analyzed for 4,671 non-pseudo- or

non-paralogous chromosomal coding E. coli

Sakai NCBI reference sequence genes. This

genome-wide gene-by-gene allele analysis

was performed with the Ridom SeqSphere+

software. (b) Venn diagram of consensus

sequencing accuracy for PGM 300 bp, MiSeq 2

× 250-bp PE (MIS) and GSJ. reported

consensus errors were analyzed for 4,632 coding

Sakai genes that could be retrieved using

SeqSphere+ for all three platforms. Numbers of

variants confirmed by bidirectional sanger

sequencing are indicated in parentheses.

*Avoidance of the term core genome as core genome genes are here determined from DNA

with rather high similarity values!

*


Maiden et al. (October, 2013). Nature Rev. Microbiol. 11: 728 [PubMed].

PathoNGenTrace Yearly Meeting (May 13th - 14th, 2013). Cambridge, UK.Bruno Pot and Hannes Pouseele, Applied Maths. Kmers are the ways how to

compare genomes (work done together with Ilya Chorny, Illumina).

IMMEM X (October 2nd - 5th, 2013). Paris, France.Hannes Pouseele, Applied Maths. Seven ways (= one of

them wgMLST) how to leave your lover (= PFGE).

cgMLST at that time for the

authors NOT a fixed set of

loci but ‘shared’ loci of

selected isolates under

study.


Kohl et al. (April, 2014). JCM 52: 2479 [PubMed].

First original publication using the term cgMLST and using a fixed genome-wide set of genes.


Tools for Microbial

Genotypic Surveillance and Phylogeny

Wyres et al. (2014). WGS analysis and interpretation in clinical and public health microbiology laboratories:

what are the requirements and how do existing tools compare? Pathogens 3: 437 [doi:10.3390/pathogens3020437].

__________________________

ENA Sub- Included Nomen-

mission Database clature

__________________________

__________________________

Yes Yes No

No No No

No No No

__________________________

No Yes Yes

No No No

Yes Yes Yes

No No No

__________________________

WWW

WWW

WWW

WWW

http://www.mdpi.com/2076-0817/3/2/437

Standardized Hierarchical Microbial WGS Typing

Pan-bacterial-specific

Jolley et al. (2012). Microbiology 158: 1005 [PubMed]

Global Nomenclature / Surveillance

rMLST

Species-specificSTEC: Mellmann et al. (2011). PLoS One. 6: e22751 [PubMed]

S. aureus: Leopold et al. (2014). JCM 52: 2365 [PubMed]

MtbC: Kohl et al. (2014). JCM 52: 2479 [PubMed]

K. pneumo.: Bialek et al. (2014). EID 20: 1812 [PubMed]

Lp: Moran-Gildad et al. (2015). Euro Surveill. 20: pii: 21186 [PubMed]

Listeria: Ruppitsch et al. (2015). JCM 53: 2869 [PubMed]

E. faecium: de Been et al. (2015). JCM 53: [PubMed]

cgMLST

MLST

SNPsconfirmatory/canonical

Standardized hierarchical microbial WGS typing approach. From bottom to

top with increasing discriminatory power. MLST, multi locus sequence typing;

rMLST, ribosomal MLST; cgMLST, core genome MLST; wgMLST, whole

genome MLST, and SNP, single nucleotide polymorphism.

Species-specific

e.g., Van Ert et al. (2007). JCM 45: 47 [PubMed]

Maiden et al. (1998). PNAS 95: 3140 [PubMed]

also needed for backwards compatibility

Dis

crim

ina

tory

Pow

er

Speciation by rMLST

Evolutionary Analysis

SNPs*

Allelesfrom accessory reference ge-

nome genes or pan-genome

based wgMLST

Local Outbreak InvestigationOutbreak- / Lineage-specificSNPe.g., Köser et al (2012). NEJM 366: 2267 [PubMed]

wgMLST/’shared’ genomeN. meng.: Jolley et al. (2012). JCM. 50: 3046 [PubMed]

C. jejuni: Cody et al. (2013). JCM. 51: 2526 [PubMed]

*from de novo assembled [PubMed] and/or mapped genomes















cgMLST Challenges for Harmonization

cgMLST and API/Ontology Workshop

Organization: Martin Maiden and Dag Harmsen

Date: 2nd & 3rd March, 2015

Place: Oxford University, UK

Participants: Oxford Univ., Univ. Münster, FZ Borstel, Univ. Warwick,

Inst. Pasteur, Univ. Lisboa, PHE, CDC, Ridom, and Applied Maths

Informal agreement that cgMLST is a fixed and in the community

agreed upon set of genome-wide genes that is going to be at least the

minimum denominator for analyzing whole genome shotgun (WGS)

sequence data for surveillance purposes!

ECDC (October, 2015).

http://ecdc.europa.eu/en/publications/Publications/food-and-waterborne-diseases-next-generation-typing-methods.pdf.

Describes a top-down

approach that includes also

several tears of reporting (e.g.,

national and international).

However, in the past the most

successful bacterial genotyping

initiatives (e.g., MLST, spa-

typing, or MIRU-VNTR)

followed a bottom-up - grass-

root basic democratic or even

anarchic - approach.

Only the PulseNet imitative

followed a top-down approach

but never resulted in a public

nomenclature and involved

‘heavy’ investment by CDC.

http://ecdc.europa.eu/en/publications/Publications/food-and-waterborne-diseases-next-generation-typing-methods.pdf

Nomenclature is in its essence a technique to reduce the

amount of available information by assigning a short, yet

still informative human [and machine] readable code to

isolates. Where two isolates share the same code, it implies

that they have the same properties as defined by the

nomenclature scheme that is assumed to be commonly

understood and adhered to.

An additional step in assigning allele identifiers to a

particular set of loci, which also further reduces the

information to a degree that it can be used effectively for

human communication, is to assign an additional unique

identifier to each combination of alleles observed within a

single genome.

Nomenclature Assignment

ECDC (October, 2015). Expert Opinion on the introduction of next-generation typing methods for food- and waterborne diseases in the EU and EEA.



wgMLST principle: assignment of unique allele identifiers.



infinite growing*

*SeqSphere+ only uses the accessory genome of the ‘reference genome’.

BIGSdb and Bionumerics use the accessory genome of the pan genome. Furthermore,

for detecting loci/targets by similarity and overlap BIGSdb scans new draft genomes

against all alleles of a locus and not only against the allele of the ‘reference genome’ as

done by SeqSphere+ and Bionumerics. Thereby different results might be obtained

depending when the search was conducted (‘triangulation problem’).

Cluster/outbreak threshold calibration only possible on cgMLST level!

wgMLST Nomenclature


• MLST sequence type (ST) and clonal complex (CC) concept must and will be remain (among many others

reasons for backwards compatibility).

• For NGS genome-wide gene by gene allele typing with hundreds/thousands of genes/targets from a ‘WGS

typing scheme’ or with ‘core genome genes’ the allele nomenclature for every target/gene must be

controlled.

• For communication between humans (e.g. publication) and to make the results comparable on an

international scale the nomenclature of specific combinations of hundreds/thousands targets/genes

must also be controlled.

• For these specific combinations of hundreds/thousands targets/genes the term Cluster Type (CT) is

proposed.

• CT will be much more discriminatory than a ST; definition is mainly needed for outbreak

investigation/transmission chain analysis.

• CT concept must be able to cope with:

• some missing targets/genes (either not present or not sequenced by chance or not assembled well),

• a few target/gene allele differences due to NGS sequencing errors, intra-host variation and/or

micro-evolutionary changes during an outbreak, and

• different bacterial population structures (e.g., monomorphic vs. panmictic structure) and infection

dynamics (e.g., incubation period and/or transmission mode). Therefore, a CT will be species

specific.

• CT threshold is pragmatically defined as the highest observed number of allele differences in intra-

patient, consecutive and/or outbreak isolates plus 25% number of alleles (rounded) to rule-out recent

transmission for sure.

• As the CT will be ‘just’ a number and there will be no biological meaningful relationship between the CT

numbers – otherwise a single expanding nomenclature would be impossible – it is proposed to associate

with every CT the date and location (city and country) of isolation (e.g. CT 399; March 2013, Münster

Germany).

• As a CT will be specific for a ‘WGS typing scheme’ (cgMLST), it is proposed to use e.g. the phrase

Ridom cgMLST CT.

WGS Cluster Type (CT)

Problems due to:

• additive expansion

• missing data

• entry order

Taxonomical nomenclature principle based on SNP or wgMLST dendrogram.*

*Desirable BUT hardly possible for an additive expandable nomenclature system as there will be

always changes in the tree (was not possible in the past with MLST or canonical SNPs of monomorphic

bacteria; would violate stability of nomenclature). Furthermore, if done with ‘SNP addresses’ and not

with alleles very compute intensive to calculate.



Taxonomical/Phylogenetic Nomenclature


Vaz et al. (October, 2014). J Biomed Semantics 5: 43 [PubMed]

cgMLST Nomenclature Harmonization

The TypON microbial typing ontology foresees immediately a

REST application programming interface (API) for cgMLST allele

nomenclature services that allows software tools to bi-directional

communicate with each other.


cgMLST Nomenclature Server(s)

SeqSphere+: Query and authentication API to be released into public early 2016.

Submission for SeqSphere+ users already since 2013 possible without any manual curation

steps involved. Submission API for other tools foreseen for mid 2016.

BIGSdb: Query and authentication API available since mid 2015. Submission API

announced October 2015 (evaluation needed and SeqSphere+ and Bionumerics must

‘emulate’ BIGSdb mode of allele calling).

Other PathoNGenTrace Bioinformatics Activities

WGS

Genotyping

Standardization

Visualization of four dimensions

and

Early Warning

From WGS Geno- to

Phenotype Prediction

(resistome & virulome analysis)

From WGS to

Plain Language Report

http://patho-ngen-trace.eu/

https://patho-ngen-trace.eu/

SeqSphere+ Visualization of Four Dimensionsreleased with version 3.0 early October 2015 (also MLST+ term no longer used since then)

Place

Ruppitsch et al. J Clin Microbiol. 2015; 53: 2869 [PubMed].

Time

#Missing values Sample ID

Good Targets ST

Collection Date

Country of Isolation

City of Isolation

ZIP of Isolation

Cluster Type

4 12025647 99.8 398 unknown Austria ? (unknown) ? (unknown) 45

4 2010-00770 99.8 398 Feb 2, 2010 Austria Hartberg 8230 39

4 3230TP3 99.8 398 Jan 22, 2010 Austria Hartberg 8230 39

3 3230TP5 99.8 403 Jan 22, 2010 Austria Hartberg 8230 35

8 CIP105458 99.5 2 1959 USA ? (unknown) ? (unknown) 49

0 EGD-e 100.0 35 1924 United Kingdom Cambridge ? (unknown) 1

4 L10-10 99.8 398 Jan 12, 2010 Austria Zell am See 5700 39

4 L14-10 99.8 398 Jan 25, 2010 Austria Rohrbach 4150 39

4 L16-10 99.8 398 Jan 29, 2010 Austria Salzburg 5020 39

4 L17-10 99.8 398 Jan 30, 2010 Austria Krems 3500 39

4 L30-10 99.8 398 Jan 25, 2010 Austria Ried im Innkreis 4910 39

4 L33-10 99.8 398 Feb 22, 2010 Austria St. Pölten 3100 39

5 L38-11 99.7 398 2010 Austria Vienna 1010 41

4 L4-10 99.8 398 Dec 23, 2009 Austria Gänserndorf 2230 39

2 L71-09 99.9 403 Dec 10, 2009 Austria Mattersburg 7210 35

5 L75-09 99.7 398 Dec 16, 2009 Austria Vienna 1100 39

‘Person‘ by color

Type

All four dimensions

views are inter-linked

interactively and ex-

portable in publication

quality scalable vector

graphics (SVG) format.


allele calling (<5min)

Pure bacterial culture / single cell

DNA (≈3.5h)

Rapid NGS (≈28-43h)

De novo or reference

assisted assembly (<1h)

Phenotypic and

epidemiologic

information

LIMS(e.g., via

Excel file or

HL7)

One Disruptive Technology Fits it All –

Genomic Surveillance and More

MLST/rMLST

Evolutionary

analysis

Resistome /

Virulome

Surveillance &

outbreak

investigation

cgMLSTSNP /

accessory targetsAntibiotic

resistance targets

Toxins &

pathogenicity targets

Standardized

hierarchical microbial

typing and more

EBI ENA(Backup raw

reads)

cgMLST

Nomen-

clature

Server

Dissemination Activities

2nd ConferenceRapid Microbial NGS and Bioinformatics: Translation Into Practice

The event will gather experts from all over the world active in applying Next Generation Sequencing (NGS) techniques to discover the epidemiology, anti-microbial resistance, ecology and evolution of microorganisms. The program will be designed to build a bridge between software developers and end-users.

At a GlanceDate: June 9-11, 2016Place: Hamburg, GermanyComplete program to be announced online soon. Registration: will open mid December 2015 at: www.RaMi-NGS.orgContact: For questions or further information please send an email to [email protected]

The research from the PathoNGen-Trace project has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under Grant Agreement N° 278864.

http://www.rami-ngs.org/

mailto:[email protected]

Dag Harmsen

Univ. Münster, [email protected]

cgMLST Evolvement and Challenges for Harmonization

19th November, 2015

EU PathoNGenTrace Consortium

Download - EU PathoNGenTraceConsortium:cgMLST Evolvement and Challenges for Harmonization

Top Related