hetman immem xi final march 2016
TRANSCRIPT
IMMEM XINavigating Microbial Genomes: Insights from the Next Generation9 – 12 March 2016, Estoril, Portugal
The EpiQuant framework for assessing genetic and epidemiologic concordance: towards improved use of genomic data in epidemiological applications.Ben Hetman B 1,2; Steven Mutschall 1; Vic Gannon 1; James Thomas 2; and Eduardo Taboada 1
1 National Microbiology Laboratory at Lethbridge, Public Health Agency of Canada, Lethbridge AB, Canada.2 Department of Biological Sciences, University of Lethbridge, Lethbridge AB, Canada.
2
29.3
19.67
11.12
3.081.94 1.56
0.36 0.32
Campylobacteriosis Salmonellosis GiardiasisShigellosis Verotoxigenic E. coli (VTEC) CryptosporidiosisListeriosis Cyclosporiasis
(*447)
(*269)
(*24)
(*4)(*39) (*7)
(*.55) (*7.5)
Thomas et al (2013). doi:10.1089/fpd.2012.1389FoodNet Canada Short Report 2013
***Post-correction estimate
Campylobacter is a public health challenge
#1 bacterial gastrointestinal disease in Canada and a leading foodborne pathogen worldwide (300-500 million cases)
Self-limiting illness, highly under-reported, largely sporadic
3
The epidemiology of campylobacteriosis is daunting
Source: Julie Arsenault (PhD Thesis)papyrus.bib.umontreal.ca/jspui/handle/1866/4625
Widespread in “farm-to-fork” and “source-to-tap” high prevalence in most major livestock species found in many wild animal species, insects, surface waters
Difficult to establish sources of exposure and routes of transmission Crisis = Opportunity WGS to the rescue!!!
4
Can rapidly generate different clusters of isolates at an almost unlimited number of thresholds
E.g.:Do groups formed by genomic relationships agree with those formed by epidemiologic relationships ?
- OR -
What is the optimal threshold for forming clusters that agree with epidemiologic relationships?
Genomic data…so many options for thresholding
WGS based analyses still require knowledge of the epidemiology to guide clustering of genomic data into “epidemiologically relevant clusters”
5
Those who make many species are the 'splitters' and those who
make few are the 'lumpers’… – CD (1857)
Clustering thresholds have been with us forever…
Need to calibrate our analysises to ensure our results exploit the high resolution of WGS data while remaining epidemiologically relevant
6
Building a model for quantifying epidemiological similarity
“Essentially, all models are wrong, but some are useful.”
George E.P. Box (1919-2013)
7
How to relate epidemiologic and genomic clustering?1. Adjusted Wallace Coefficient: (AWC) Carriço et al. (Comparing Partitions)
The directional likelihood that two isolates clustered together using one method will be grouped together in the second method
AWCStrain 1 Strain 2 Strain 1 Strain 2WGS clusters Epi clusters
2. Intra-cluster cohesion: (ICC)A measure of the of the genomic and/or epidemiologic homogeneity of the isolates within a cluster
High ICC Low ICC
8
Comparing epidemiology vs. genomics
Need a model to assess strain to strain relationships based on isolate epidemiology so we can directly compare them against the WGS data
Core Analysis
MIST
Source
Location
Date
Genomics Workflow
Epidemiology Workflow
Sequencing Assembly AnnotationIn-Silico Typing
Cluster Analysis&
Analysis of concordance
Metadata Curation Quantify Epi-Similarities
Isolate Selection
The challenge with epidemiological data
Source SpatialTemporal
Surveillance data is inherently less comprehensive than outbreak data Metadata is generally qualitative/categorical, not quantitative
Source SpatialTemporal
Establish a metric that summarizes the relationships between isolates based on basic epidemiologic metadata
Clustering of isolates based on epidemiological metadata
Our proposed approach: A model for quantifying epidemiological similarity between strains based on three primary factors: source, space, time
EpiSym = σ(source) + γ(geospatial) + τ(temporal)σ = coefficient for Sourceγ = coefficient for Geospatialτ = coefficient for Temporal
Building a model for epidemiological similarity
11
Spatial =
Where• distab is given by the Haversine formula• x, y = sampling dates
Temporal =
Quantifying epi-similarities: Spatial and Temporal
‘Spatial’ and ‘Temporal’ factors required for the EpiSym coefficient are relatively simple to build into the equation
12
Identify all available sources Identify core epidemiological attributes Assess each source independently and completely for each
attribute Score the pairwise similarity between any two sources based
on their shared epidemiological attributes
Source
Quantifying Source-Source Similarities
=Where• i, j = two sources being compared• *(i + j) = number of matching attributes• n = maximum possible score
EpiSym
13
An example: ‘faecal cow’ vs. ‘retail chicken’
Similarity:
= 12.519 = 0.658
Σ Pairwise MatchesMaximum
Possible Score
=
Once source similarity is quantified, we can compute overall EpiSym
We can systematically compute EpiSym across large datasets epi clusters Comparison to genomic clusters using cluster concordance metrics
14
Clinical
Animal
Environmental
A
B
C
DE
F
GH
I
JK
L
M
N
O
P Major clusters based on source factor Subclusters further refined by spatial and temporal factors
Results: epidemiological clustering of C. jejuni isolates
15
Clinical
Animal
Environmental
A
B
C
DE
F
GH
I
JK
L
M
N
O
P
1 2
3
4
5
Clusters of secondary heat correspond to isolates with similar geography and temporal data, but different sources
Results: epidemiological clustering of C. jejuni isolates
16
Calibrating WGS typing for epidemiologic investigations
We can identify the clusters obtained at varying thresholds and compare them to epidemiological clusters to look for ‘best-fit’
An advantage of WGS is the flexibility in thresholding that is possible
17
Calibrating WGS typing for epidemiologic investigations
Genomic cluster homogeneity
vs.Epidemiologic cluster
homogeneity
Calculate point of highest genomic-cohesion while maintaining Multi-isolate clusters High epidemiologic validity
18
Epi vs. Genomic clustering: examining the outliers
Strains with similar epidemiology aren’t necessarily similar genomically (and vice-versa!)
By overlaying the two methods, we can identify clusters that group together significantly stronger via genomic or epidemiologic relationships
“Epi-Clustering “Genomic-Clustering”
19
White = high congruence
Green = stronger similarity via epi
Blue = stronger similarity via genotype
“Generalist genotype”
“Generalist source”
‘Generalist’ genotypes persist across many conbinations of source, temporal and spatial parameters
‘Generalist’ reservoirs support the persistence of a broad range of genotypes
Epi vs. Genomic clustering: examining the outliers
20
Summary We have developed a model to help guide our analysis of Campylobacter
WGS data for practical public health purposes
Systematic examination of the relationship between the genomic and epidemiological similarity of sets of isolates optimization of clustering for epidemiologic relevance
Calculate point of highest genomic-cohesion while maintaining High epidemiologic cohesion Multi-isolate clusters
Interactive web application under development (Check it out!) https://hetmanb.shinyapps.io/EpiQuant/
21
AcknowledgementsPeople• Supervisors:
Ed Taboada + Jim Thomas• Lab:
Steven Mutschall (PHAC)Peter Kruczkiewicz (PHAC)Dillon Barker (PHAC/ULeth)
Funding• University of Lethbridge• Public Health Agency of Canada A-base • Gov’t of Canada: Genomics Research and Development
Initiative