hetman immem xi final march 2016

IMMEM XINavigating Microbial Genomes: Insights from the Next Generation9 – 12 March 2016, Estoril, Portugal

The EpiQuant framework for assessing genetic and epidemiologic concordance: towards improved use of genomic data in epidemiological applications.Ben Hetman B 1,2; Steven Mutschall 1; Vic Gannon 1; James Thomas 2; and Eduardo Taboada 1

1 National Microbiology Laboratory at Lethbridge, Public Health Agency of Canada, Lethbridge AB, Canada.2 Department of Biological Sciences, University of Lethbridge, Lethbridge AB, Canada.

2

29.3

19.67

11.12

3.081.94 1.56

0.36 0.32

Campylobacteriosis Salmonellosis GiardiasisShigellosis Verotoxigenic E. coli (VTEC) CryptosporidiosisListeriosis Cyclosporiasis

(*447)

(*269)

(*24)

(*4)(*39) (*7)

(*.55) (*7.5)

Thomas et al (2013). doi:10.1089/fpd.2012.1389FoodNet Canada Short Report 2013

***Post-correction estimate

Campylobacter is a public health challenge

#1 bacterial gastrointestinal disease in Canada and a leading foodborne pathogen worldwide (300-500 million cases)

Self-limiting illness, highly under-reported, largely sporadic

3

The epidemiology of campylobacteriosis is daunting

Source: Julie Arsenault (PhD Thesis)papyrus.bib.umontreal.ca/jspui/handle/1866/4625

Widespread in “farm-to-fork” and “source-to-tap” high prevalence in most major livestock species found in many wild animal species, insects, surface waters

Difficult to establish sources of exposure and routes of transmission Crisis = Opportunity WGS to the rescue!!!

4

Can rapidly generate different clusters of isolates at an almost unlimited number of thresholds

E.g.:Do groups formed by genomic relationships agree with those formed by epidemiologic relationships ?

- OR -

What is the optimal threshold for forming clusters that agree with epidemiologic relationships?

Genomic data…so many options for thresholding

WGS based analyses still require knowledge of the epidemiology to guide clustering of genomic data into “epidemiologically relevant clusters”

5

Those who make many species are the 'splitters' and those who

make few are the 'lumpers’… – CD (1857)

Clustering thresholds have been with us forever…

Need to calibrate our analysises to ensure our results exploit the high resolution of WGS data while remaining epidemiologically relevant

6

Building a model for quantifying epidemiological similarity

“Essentially, all models are wrong, but some are useful.”

George E.P. Box (1919-2013)

7

How to relate epidemiologic and genomic clustering?1. Adjusted Wallace Coefficient: (AWC) Carriço et al. (Comparing Partitions)

The directional likelihood that two isolates clustered together using one method will be grouped together in the second method

AWCStrain 1 Strain 2 Strain 1 Strain 2WGS clusters Epi clusters

2. Intra-cluster cohesion: (ICC)A measure of the of the genomic and/or epidemiologic homogeneity of the isolates within a cluster

High ICC Low ICC

8

Comparing epidemiology vs. genomics

Need a model to assess strain to strain relationships based on isolate epidemiology so we can directly compare them against the WGS data

Core Analysis

MIST

Source

Location

Date

Genomics Workflow

Epidemiology Workflow

Sequencing Assembly AnnotationIn-Silico Typing

Cluster Analysis&

Analysis of concordance

Metadata Curation Quantify Epi-Similarities

Isolate Selection

The challenge with epidemiological data

Source SpatialTemporal

Surveillance data is inherently less comprehensive than outbreak data Metadata is generally qualitative/categorical, not quantitative

Source SpatialTemporal

Establish a metric that summarizes the relationships between isolates based on basic epidemiologic metadata

Clustering of isolates based on epidemiological metadata

Our proposed approach: A model for quantifying epidemiological similarity between strains based on three primary factors: source, space, time

EpiSym = σ(source) + γ(geospatial) + τ(temporal)σ = coefficient for Sourceγ = coefficient for Geospatialτ = coefficient for Temporal

Building a model for epidemiological similarity

11

Spatial =

Where• distab is given by the Haversine formula• x, y = sampling dates

Temporal =

Quantifying epi-similarities: Spatial and Temporal

‘Spatial’ and ‘Temporal’ factors required for the EpiSym coefficient are relatively simple to build into the equation

12

Identify all available sources Identify core epidemiological attributes Assess each source independently and completely for each

attribute Score the pairwise similarity between any two sources based

on their shared epidemiological attributes

Source

Quantifying Source-Source Similarities

=Where• i, j = two sources being compared• *(i + j) = number of matching attributes• n = maximum possible score

EpiSym

13

An example: ‘faecal cow’ vs. ‘retail chicken’

Similarity:

= 12.519 = 0.658

Σ Pairwise MatchesMaximum

Possible Score

=

Once source similarity is quantified, we can compute overall EpiSym

We can systematically compute EpiSym across large datasets epi clusters Comparison to genomic clusters using cluster concordance metrics

14

Clinical

Animal

Environmental

A

B

C

DE

F

GH

I

JK

L

M

N

O

P Major clusters based on source factor Subclusters further refined by spatial and temporal factors

Results: epidemiological clustering of C. jejuni isolates

15

Clinical

Animal

Environmental

A

B

C

DE

F

GH

I

JK

L

M

N

O

P

1 2

3

4

5

Clusters of secondary heat correspond to isolates with similar geography and temporal data, but different sources

Results: epidemiological clustering of C. jejuni isolates

16

Calibrating WGS typing for epidemiologic investigations

We can identify the clusters obtained at varying thresholds and compare them to epidemiological clusters to look for ‘best-fit’

An advantage of WGS is the flexibility in thresholding that is possible

17

Calibrating WGS typing for epidemiologic investigations

Genomic cluster homogeneity

vs.Epidemiologic cluster

homogeneity

Calculate point of highest genomic-cohesion while maintaining Multi-isolate clusters High epidemiologic validity

18

Epi vs. Genomic clustering: examining the outliers

Strains with similar epidemiology aren’t necessarily similar genomically (and vice-versa!)

By overlaying the two methods, we can identify clusters that group together significantly stronger via genomic or epidemiologic relationships

“Epi-Clustering “Genomic-Clustering”

19

White = high congruence

Green = stronger similarity via epi

Blue = stronger similarity via genotype

“Generalist genotype”

“Generalist source”

‘Generalist’ genotypes persist across many conbinations of source, temporal and spatial parameters

‘Generalist’ reservoirs support the persistence of a broad range of genotypes

Epi vs. Genomic clustering: examining the outliers

20

Summary We have developed a model to help guide our analysis of Campylobacter

WGS data for practical public health purposes

Systematic examination of the relationship between the genomic and epidemiological similarity of sets of isolates optimization of clustering for epidemiologic relevance

Calculate point of highest genomic-cohesion while maintaining High epidemiologic cohesion Multi-isolate clusters

Interactive web application under development (Check it out!) https://hetmanb.shinyapps.io/EpiQuant/

https://hetmanb.shinyapps.io/EpiQuant/



21

AcknowledgementsPeople• Supervisors:

Ed Taboada + Jim Thomas• Lab:

Steven Mutschall (PHAC)Peter Kruczkiewicz (PHAC)Dillon Barker (PHAC/ULeth)

Funding• University of Lethbridge• Public Health Agency of Canada A-base • Gov’t of Canada: Genomics Research and Development

Initiative