dr marion wooldridge 26th july 2002
Post on 07-Jan-2016
24 Views
Preview:
DESCRIPTION
TRANSCRIPT
Dr Marion WooldridgeDr Marion Wooldridge
26th July 200226th July 2002
Data quality, combining Data quality, combining multiple data sources, and multiple data sources, and
distribution fittingdistribution fitting
Selected points and Selected points and illustrations!illustrations!
Outline of talkOutline of talk
Data qualityData qualitywhat do we need?what do we need?
Combining multiple data sourcesCombining multiple data sourceswhen can we do it?when can we do it?an example explainedan example explained
Fitting distributionsFitting distributionshow do we do it?how do we do it?an example explained an example explained
What do we mean by data?What do we mean by data?
Experimental resultsExperimental resultsnumericalnumerical
Field dataField datanumericalnumerical
Information on pathways/processesInformation on pathways/processeswords and numberswords and numbers
Expert opinionExpert opinionwords to convert to numbers?!words to convert to numbers?!
Why do we need data?Why do we need data?
Model constructionModel constructionrisk pathwaysrisk pathways
what actually happens?what actually happens? do batches get mixed?do batches get mixed? is the product heated? etc….is the product heated? etc….
Model inputsModel inputsestimate probabilitiesestimate probabilities
estimate uncertaintiesestimate uncertainties estimate variabilityestimate variability
Which data should we use?Which data should we use?
The best available!The best available!However…….However…….
a risk assessment is pragmatica risk assessment is pragmatica risk assessment is ‘iterative’a risk assessment is ‘iterative’there are always data deficienciesthere are always data deficiencies
these are often ‘crucial’these are often ‘crucial’
So what does this tell us?…..So what does this tell us?…..
What do we mean by pragmatic?What do we mean by pragmatic?
Purpose of risk assessment…..Purpose of risk assessment…..
To give useful info to risk manager To give useful info to risk manager to aid decisions, usually in short termto aid decisions, usually in short term
This leads to time constraintsThis leads to time constraintsThis generally means using This generally means using currentlycurrently available available
data for decisions data for decisions ‘now’‘now’ however incomplete or uncertain that datahowever incomplete or uncertain that data
What do we mean by ‘iterative’?What do we mean by ‘iterative’?
Often several stages, with different purposesOften several stages, with different purposes
‘‘Is there really a problem?’ Is there really a problem?’ rapid preliminary assessmentrapid preliminary assessment
‘‘OK, so give me the answer’OK, so give me the answer’ refine assessmentrefine assessment
‘‘Did we get it right?’Did we get it right?’ revisit assessment if/when new datarevisit assessment if/when new data
The The minimumminimum data quality required….. data quality required…..
will be different at each stagewill be different at each stageso - why is that?….so - why is that?….
‘‘Is there really a problem?’ Is there really a problem?’ Risk manager needs rapid answerRisk manager needs rapid answer
‘‘preliminary’ risk assessmentpreliminary’ risk assessmentidentify data rapidly availableidentify data rapidly available
may be incomplete, anecdotal, old, from a different country, about a may be incomplete, anecdotal, old, from a different country, about a different strain, from a different species, etc. etc……. different strain, from a different species, etc. etc…….
Still allows decision to be madeStill allows decision to be madea problem….. a problem…..
is highly likely - do more now!is highly likely - do more now! may develop - keep watching briefmay develop - keep watching brief is highly unlikely - but never zero risk!is highly unlikely - but never zero risk!
Stage 1: Conclusion...Stage 1: Conclusion...
Sometimes, poor quality data may be Sometimes, poor quality data may be useful datauseful data
is it ‘fit-for-purpose’is it ‘fit-for-purpose’
Don’t just throw it out!Don’t just throw it out!
‘‘OK, so give me the answer’OK, so give me the answer’
refine risk assessmentrefine risk assessmentset up modelset up model
data on risk pathwaysdata on risk pathways model input data model input data
utilise ‘best’ data found, identify ‘real’ gaps & uncertaintiesutilise ‘best’ data found, identify ‘real’ gaps & uncertainties elicit expert opinionselicit expert opinions
data should be ‘best’ currently availabledata should be ‘best’ currently availablewith time allowed to search all reasonable sourceswith time allowed to search all reasonable sources
still often incomplete and uncertainstill often incomplete and uncertain checked by peer reviewchecked by peer review
‘‘So the risk is…….’So the risk is…….’
The best currently available data gives….The best currently available data gives….the best currently available estimatethe best currently available estimate
Which is….Which is…. the best the risk manager can ever have ‘now’!the best the risk manager can ever have ‘now’!
Even if…Even if…it is still based on data gaps and uncertaintiesit is still based on data gaps and uncertainties
which it will be!which it will be! the ‘best’ data may still be poor datathe ‘best’ data may still be poor data allows targeted future data collection - may be lengthyallows targeted future data collection - may be lengthy
Stage 2: Conclusion….Stage 2: Conclusion….
If a choiceIf a choiceNeed to identify the best data availableNeed to identify the best data available
What makes it the best data?What makes it the best data?
What do we do if we can’t decide?What do we do if we can’t decide? Multiple data sources for model!Multiple data sources for model!
What do we do with crucial data gaps?What do we do with crucial data gaps?Often need expert opinionOften need expert opinion
How do we turn that into a number?How do we turn that into a number?
‘‘Did we get it right?’Did we get it right?’
revisit assessment if/when new data revisit assessment if/when new data availableavailable
targetted data collectiontargetted data collection may be years away….may be years away…. but allows quality to be specified but allows quality to be specified
not just ‘best’ but ‘good’not just ‘best’ but ‘good’ will minimise uncertaintywill minimise uncertainty should describe variabilityshould describe variability
What makes data ‘good’?What makes data ‘good’?
Having said all that - some data is ‘good’!Having said all that - some data is ‘good’!Two aspectsTwo aspects
Intrinsic features of the dataIntrinsic features of the data universally essential for high qualityuniversally essential for high quality
Applicability to current situationApplicability to current situation data selection affects quality of the assessment and outputdata selection affects quality of the assessment and output
General principles…... General principles…...
High quality data should be….High quality data should be….
CompleteComplete for assessment in handfor assessment in hand required level of detailrequired level of detail
RelevantRelevant e.g. right bug, country, management system, date etc. etc.e.g. right bug, country, management system, date etc. etc. nothing irrelevantnothing irrelevant
Succinct and transparentSuccinct and transparentFully referencedFully referencedPresented in a logical sequence Presented in a logical sequence
High quality data should have….High quality data should have….
Full information on data source or Full information on data source or provenanceprovenance
Full reference if paper/similarFull reference if paper/similarName if Name if pers. commpers. comm. or unpublished. or unpublishedCopy of web-page or other ephemeral sourceCopy of web-page or other ephemeral sourceDate of data collection/elicitationDate of data collection/elicitationAffiliation and/or funding source of providerAffiliation and/or funding source of provider
High quality data should have….High quality data should have….
A description of the level of uncertainty and the A description of the level of uncertainty and the amount of variabilityamount of variability
uncertainty = lack of knowledgeuncertainty = lack of knowledge variability = real life situationvariability = real life situation
UnitsUnits where appropriatewhere appropriate
Raw numerical dataRaw numerical data where appropriatewhere appropriate
High quality study data should High quality study data should have….have….Detailed information on study design: e.g.Detailed information on study design: e.g.
Experimental or field basedExperimental or field basedDetails of sample and sampling frame including:Details of sample and sampling frame including:
livestock species (inc. scientific name)/product definition/population sub-group livestock species (inc. scientific name)/product definition/population sub-group definitiondefinition
source (country, region, producer, retailer, etc.)source (country, region, producer, retailer, etc.) sample selection method (esp for infection: clinical cases or random selection) sample selection method (esp for infection: clinical cases or random selection) Population sizePopulation size
Season of collectionSeason of collectionPortion description or sizePortion description or sizeMethod of sample collectionMethod of sample collection
High quality microbiological data High quality microbiological data should have….should have…. Information on microbiological methods Information on microbiological methods
including:including:Pathogen species, subspecies, strainPathogen species, subspecies, strain
may include antibiotic resistance patternmay include antibiotic resistance pattern
tests used, including any variation from published tests used, including any variation from published methodsmethods
sensitivity and specificity of testssensitivity and specificity of testsunits usedunits usedprecision of measurementprecision of measurement
High quality numerical data High quality numerical data should have….should have….
Results as raw dataResults as raw data Including:Including:
Number testedNumber tested results given for all samples testedresults given for all samples tested for pathogens, number of micro-organisms for pathogens, number of micro-organisms
not just positive/negative. not just positive/negative.
A note on comparability….A note on comparability….
High quality data sets can easily be checked for High quality data sets can easily be checked for comparabilitycomparability
as they contain sufficient detailas they contain sufficient detail are they describing or measuring the same thing?are they describing or measuring the same thing? are the levels of uncertainty similar?are the levels of uncertainty similar?
Poor quality data sets are difficult to comparePoor quality data sets are difficult to compare lack detaillack detail
may never know if describing or measuring same thing!may never know if describing or measuring same thing! or may be exactly same data!!or may be exactly same data!! difficult/unwise to combine!difficult/unwise to combine!
and on homogeneity….. and on homogeneity…..
homogeneity of input data aids homogeneity of input data aids comparability of model outputcomparability of model output
homogeneity achieved by, e.g. homogeneity achieved by, e.g. standardised methodsstandardised methods
for sampling, testing etc.for sampling, testing etc.
standard unitsstandard unitsstandard definitionsstandard definitions
for pathogen, host, product, portion size, etc.for pathogen, host, product, portion size, etc.
Multiple data sources - why?Multiple data sources - why?
No data specific to the problem - but several studies No data specific to the problem - but several studies with some relevancewith some relevance
e.g. one old, one a different country etce.g. one old, one a different country etc
Several studies directly relevant to the problem - all Several studies directly relevant to the problem - all information needs inclusioninformation needs inclusion
e.g. regional prevalence studies available; need nationale.g. regional prevalence studies available; need national
Expert opinion - but a number of expertsExpert opinion - but a number of expertsand may need to combine with other dataand may need to combine with other data
Multiple data sources….how?Multiple data sources….how?
model all separatelymodel all separately ‘‘what-if’ scenarioswhat-if’ scenarios several outputsseveral outputs for data inputs and ‘model uncertainty’for data inputs and ‘model uncertainty’
fit single distributionfit single distribution point valuespoint values
weight alternativesweight alternatives equally or differentially? equally or differentially? expert opinion on weights?expert opinion on weights? sample by weight distribution - single outputsample by weight distribution - single output for data inputs and ‘model uncertainty’for data inputs and ‘model uncertainty’
Weighting exampleWeighting exampleOverall aim: Overall aim:
To estimate probability that a random bird from GB broiler poultry flock will To estimate probability that a random bird from GB broiler poultry flock will be campylobacter positive at point of slaughterbe campylobacter positive at point of slaughter
Current need: Current need: To estimate probability that a random flock is positiveTo estimate probability that a random flock is positive
Data available:Data available: Flock positive prevalence data from 4 separate studiesFlock positive prevalence data from 4 separate studies
Ref: Hartnett E, Kelly L, Newell D, Wooldridge M, Gettinby G (2001).Ref: Hartnett E, Kelly L, Newell D, Wooldridge M, Gettinby G (2001). The occurrence of campylobacter in broilers at the point of slaughter, a quantitative risk assessment. The occurrence of campylobacter in broilers at the point of slaughter, a quantitative risk assessment.
Epidemiology and Infection 127; 195-206.Epidemiology and Infection 127; 195-206.
Probability flock is positive, Probability flock is positive, PPfpfp, per , per
data setdata set Raw data usedRaw data used PPfpfp per data setper data set
‘‘standard’ methodstandard’ method Beta distributionBeta distribution
r = positive flocksr = positive flocks s = flocks sampleds = flocks sampled
PPfp fp = Beta (r+1,s-r+1)= Beta (r+1,s-r+1)
for P1, P2, P4, P4for P1, P2, P4, P4 describes uncertainty describes uncertainty
per data setper data set 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
Probability a flock is positive
f(x
)
p1fp
p2fp
p3fp
p4fp
Probability random GB flock is Probability random GB flock is positive - weighting method usedpositive - weighting method used
Data sets:Data sets: 2 x poultry companies2 x poultry companies 1 x colleagues (published) 1 x colleagues (published)
epidemiological studyepidemiological study 1 x another published study1 x another published study
Other data available:Other data available: Market share by total bird Market share by total bird
numbers for studies 1,2,3numbers for studies 1,2,3
Assumption made: Assumption made: study 4 was ‘rest of study 4 was ‘rest of
market’market’ approximationapproximation
Weighting method:Weighting method: uses all available uses all available
infoinfo better than ‘equal better than ‘equal
weighting’weighting’
So - weights assigned….So - weights assigned….
based on market share per studybased on market share per studyP1+P2 = 35%; w1+w2 = 0.35P1+P2 = 35%; w1+w2 = 0.35
combined here as confidential datacombined here as confidential data
P3=50%; w3 = 0.50P3=50%; w3 = 0.50P4=15%; w4 = 0.15P4=15%; w4 = 0.15
probability random flock positive, probability random flock positive, PPfpfp
)44()33()22()11( wPwPwPwPP fpfpfpfpfp
So - when to combine?So - when to combine?Depends on info neededDepends on info needed
in example, for random GB flockin example, for random GB flock must combinemust combine
but - loses infobut - loses info in example, no probability by sourcein example, no probability by source
Depends on info availableDepends on info available and assumptions on relevance/comparabilityand assumptions on relevance/comparability e.g: two studies 10 years old and 5 years old:e.g: two studies 10 years old and 5 years old:
if prevalence studies - leave separate? Trend info? ‘What-if’ scenarios better?if prevalence studies - leave separate? Trend info? ‘What-if’ scenarios better? if effects of heat on organism - will this have altered? Check method, detection if effects of heat on organism - will this have altered? Check method, detection
efficacy, strain etc - but age of study efficacy, strain etc - but age of study per seper se not relevant not relevant
Assessors judgementAssessors judgement
Fitting distributionsFitting distributions
Could fill a book (has done!)Could fill a book (has done!)Discrete or continuous?Discrete or continuous?
Limited values ‘v’ any value (within the range); e.g.Limited values ‘v’ any value (within the range); e.g. number of chickens in flock - discrete (1,2,3... etc)number of chickens in flock - discrete (1,2,3... etc) chicken bodyweight - continuous (within range min-max)chicken bodyweight - continuous (within range min-max)
Parametric or non-parametric?Parametric or non-parametric? Theoretically described ‘v’ directly from data; e.g. Theoretically described ‘v’ directly from data; e.g.
incubation period - theoretically lognormal - parametricincubation period - theoretically lognormal - parametric percentage free-range flocks by region - direct from datapercentage free-range flocks by region - direct from data
Truncated or not?Truncated or not? For unbounded distributions - e.g. incubation period > lifespan?For unbounded distributions - e.g. incubation period > lifespan?
Where do we start?Where do we start?Discrete or continuous?Discrete or continuous?
Inspect the dataInspect the data whole numbers = limited values = discretewhole numbers = limited values = discrete
Parametric or non-parametric?Parametric or non-parametric? Use parametric with care! - usually if:Use parametric with care! - usually if:
theoretical basis known and appropriatetheoretical basis known and appropriate has proved to be accurate even if theory undemonstratedhas proved to be accurate even if theory undemonstrated parameters define distributionparameters define distribution
non-parametric non-parametric generally more useful/appropriategenerally more useful/appropriate
For biological process dataFor biological process data for selected distribution - biological plausibility (at least!)for selected distribution - biological plausibility (at least!)
gross error check…..gross error check…..
Distribution fitting example...Distribution fitting example...
Aim: Aim: To estimate, for a random GB broiler flock at slaughter, its To estimate, for a random GB broiler flock at slaughter, its
probable age probable age
Data available:Data available: Age at slaughter, in weeks, for a large number of GB broiler flocksAge at slaughter, in weeks, for a large number of GB broiler flocks
Ref: Hartnett E (2001).Ref: Hartnett E (2001). Human infection with Campylobacter spp. from chicken consumption: A quantitative risk assessment. Human infection with Campylobacter spp. from chicken consumption: A quantitative risk assessment.
PhD Thesis, Department of Statistics and Modelling Science, University of Strathclyde. 322 pages. PhD Thesis, Department of Statistics and Modelling Science, University of Strathclyde. 322 pages. Funded by, and work undertaken in, The Risk Research Department, The Veterinary Laboratories Funded by, and work undertaken in, The Risk Research Department, The Veterinary Laboratories
Agency, UK. Agency, UK.
Discrete or continuous?Discrete or continuous?Inspect data…age given in whole weeks - but:Inspect data…age given in whole weeks - but:
no reason why slaughter should be at specific weekly intervalsno reason why slaughter should be at specific weekly intervals time is a time is a continuouscontinuous variable variable
Parametric or non-parametric?Parametric or non-parametric?No theoretical basisNo theoretical basis
non-parametric more appropriatenon-parametric more appropriate
Other considerationsOther considerationsslaughter for meatslaughter for meat
shouldn’t be too young (too small) or too old (industry economics)shouldn’t be too young (too small) or too old (industry economics) suggests bounded distributionsuggests bounded distribution
Sequence of steps: 1. Decisions….Sequence of steps: 1. Decisions….
2. What does the distribution look like? 2. What does the distribution look like? Scatter plot drawn…..Scatter plot drawn…..
Suggests triangular ‘appearance’Suggests triangular ‘appearance’
0
2
4
6
8
10
12
25 35 45 55 65 75
Age at slaughter
Fre
qu
en
cy
3. Is this appropriate?3. Is this appropriate?
Triangular distribution is..Triangular distribution is..continuouscontinuousnon-parametricnon-parametricboundedbounded
Sounds good so far!Sounds good so far!‘‘checked’ in BestFit (Palisade)checked’ in BestFit (Palisade)
only AFTER logic consideredonly AFTER logic considered gave Triang as best fit gave Triang as best fit
Comparison of Input Distribution and Triangulardistribution
0.00
0.04
0.08
28.0 35.2 42.4 49.6 56.8 64.0
Input
Triang
4. Check: Rank 1 using Kolmogorov-Smirnov test in 4. Check: Rank 1 using Kolmogorov-Smirnov test in BestFitBestFit
Conclusion…..Conclusion…..
Use Triang in model!Use Triang in model!And it was…. And it was….
Note: This was also an example of combining (point value) data from multiple sources by fitting a single distribution Note: This was also an example of combining (point value) data from multiple sources by fitting a single distribution
Summary….Summary….
Acceptable data quality Acceptable data quality requires judgementrequires judgement
Multiple data sets - to combine or not?Multiple data sets - to combine or not? requires judgementrequires judgement
Distribution - the most appropriate?Distribution - the most appropriate? requires judgement requires judgement
Risk assessment is art plus scienceRisk assessment is art plus science assessor frequently uses judgement assessor frequently uses judgement and makes assumptionsand makes assumptions
Only safeguard: transparency!Only safeguard: transparency!
top related