text mining in animal health surveillance
TRANSCRIPT
Text Mining in Animal Health Surveillance
John BerezowskiClarissa SnyderLindsay MclartyFood Safety DivisionAlberta Agriculture Food And Rural Development
Text Mining In Public Health
• Knowledge management– Classification of journal articles to manage and
search of databases– Classification of hospital records to allow data
mining of hospital databases to discover knowledge
• Classification of medical records for real time surveillance – Free text emergency room chief complaints
classified into syndromes eg GI or Influenza like
Purpose• Canada-Alberta BSE Surveillance Program:
CABSESP Alberta Veterinarians participate in BSE surveillance Submit cattle samples for BSE testing
Dead or euthanized Examine cattle prior to sampling Provide data about farmers and animals tested
• Purpose: maximize information about cattle tested Especially why cattle were: sick/dead/sampled Assist CFIA to identify ‘Clinical Suspects”
Purpose
• Large sample (July 04 - July 06)– 35,720 Alberta cattle tested by AAFRD
• Another 25,000(+/-) tested by the CFIA
– 9,117 farms– 141 veterinary clinics (293 veterinarians)
• Purpose: evaluate utility of BSE submission form data for other surveillance purposes
Submission Form Data
• Farmer ID, date, location, number on farm• Purebred (y/n), breed, age, sex, BCS, PM (y/n)• Diseased, Distressed, Down, Dead, Neuro
• Clinical signs in free text format
• Presumptive diagnosis in free text format
Example Submission
• Clinical Signs: Cow was in dry lot. Went off feed, coughing and labored breathing
• Presumptive Diagnosis: PM findings- traumatic pericarditis and abscess from hardware between reticulum and diaphragm
• Need tools (Text Mining) to extract information from free text fields
Text Mining: Definition
• Based on data mining definitions• Knowledge discovery in text• Semi or automated discovery of
trends and patterns across large volumes of text
• Computer applications that aim to aid in making sense of large volumes of text
Text Mining: Our Context • Classify cattle with respect to certain concepts:
Etiologies: Johne’s, AIP, hepatic lipidosis, LDA, IBR, unknown, etc.
Descriptors: acute, chronic, emaciated, lame, autolyzed, blind, ataxic, etc.
Clinical Presentation:Syndromes: respiratory, GI, repro etc
• Use classifications to better describe the cattle sampled and look for associations or trends within the samples
Named Entity Recognition1. Identify terms in text
-Term = textual representation of a concept
2. Classify terms-Noun vs verb vs adjective,preposition, etc.-Etiology vs descriptors: animal (pregnant) vs clinical sign (chronic)
3. Map terms to concepts in an ontology-Associate each term with one or more concepts
Bleeding
HemorrhageBled
Concept of hemorrhage
Problems With Our Data• No suitable ontology
• What’s an ontology?
– A model that links concept labels to their textual representations and defines or describes the relationships between concepts
– Machine readable descriptions of concepts and their relationships
– Examples: Dictionaries, SNOMED-SNOVET
Problems With Our Data
• Terms are formal (vet/med) + unusual
“Nephritis”, “peritonitis”, “cancer eye”, “lump jaw”, “corkscrew claw”, ‘downer”, “fatty liver”, “hardware”, “found dead”
• Specific to food animal practitioners.
Problems With Our Data• Term Variation
– A single concept is expressed in a number of different ways (synonyms)
– Probability of two experts using the same term to refer to the same concept is less than 20%1
– Arthritis: arthritis, arthritic, osteoarthritis, polyarthritis, septic-arthritis
• 1Grefenstette G. 1994
Problems With Our Data
• Term Ambiguity
– The same term is used to refer to multiple concepts
– Multiple meanings for the same term
– Boated = nutritional (feedlot, pasture), or bloated abdomen (perforated ulcer)
– Prolapse = vagina, uterus, rectum, vaginal fat, intestinal
Problems With Our Data
• No sentence structure
– “Old age, arthritis, no teeth”– “Stifle, bilateral, degenerative, arthritis”– ‘Pelvic injury, post calving, crippled “– “Down, tumor on R shoulder, losing
condition”
Build Our Ontology
• From the text fields on the submission forms
• Designed to meet our classification needs– Identify Potential “Clinical Suspects”– Classify BSE submissions into clinical
syndromes
Clinical Clinical Suspect Suspect
Over 30 Months
AliveRefractory
To Treatment
Progressive Neuro Signs
Progressive Behavior Change
Clinical Suspect
Yes Yes
Rule Outs YesNo
Yes
OR
[Alive] AND [(Refractory to tx) AND (Progressive Behavior Change OR Progressive Neuro Change) AND (No Rules Outs) AND (Over 30 months of Age)]
Clinical Suspect=
Clinical Clinical Suspect Suspect
Over 30 Months
AliveRefractory
To Treatment
Progressive Neuro Signs
Progressive Behavior Change
Clinical Suspect
Yes Yes
Rule Outs YesNo
Yes
OR
[Alive] AND [(Refractory to tx) AND (Progressive Behavior Change OR Progressive Neuro Change) AND (No Rules Outs) AND (Over 30 months of Age)]
Clinical Suspect=
Ontology
• Chronic (refractory to Tx)• Neurologic• Behavioral • Rule outs
– Lame Skin/Ocular/Mammary– Cardiovascular Sudden Death– GI Infectious Dz– Repro Edema/Swelling/Neoplasia– Respiratory Trauma– Urologic Anorexia/Wt loss
Method
• Text Mining Software– “WordStat” and “SimStat” (Provalis
Research, Quebec City, PQ)
• Spell checked text fields • Identified all words in the text fields
– 292,537 words in total, 7,266 unique
• Manually sorted words into ontology categories
Chronic
• ADVANCED DOWNHIL* • CHONIC DURATION • CHRINIC AWHILE • CHRONCI POOR_DOER • CRONIC DECLIN* • D*BILIT* EMACIAT* • DAYS_AGO
Neurological
• Ataxia• Neurological• Paresis/Paralysis• Hyperesthesia • Hypermetria• Locomotor deficits
Neurological
• Ataxia– *ATAX*, AT*XIA, AT*XIC, ATACHIA, ATAXIA, TAXIA, etc
• CNS– CN*, MENINGITIS, MENINGOMA , etc
• Neurological– CONVULS*, HEAD_PRESS*, HEPATOENCEPHALOPATHY,
N*URO*, NEUR*, etc
• Paresis/Paralysis– PARLAYSIS, PARLYSIS, PARYALYZED, PARAPARESIS,
PAREISIS, PARES*, PARETIC, etc
Behavioral
• Behavioral– *EHAV*, APPREHENS*, AVOID*, BALKING,
BAWLING, BELIGER*, BELLIGER, BELLOW*, BIZARRE, COMPULSIVELY, CRAZY, DELIROUS etc
• Hyperexcitable – ANXIETY, ANXIOUS, CHARG*, CHASE*,
EXCITEABLE, HYPERALERT, HYPEREXC*, HYPEREXCITABLE, HYPERSENSITIV*, IRRITA*, etc.
Example Submission
• Clinical Signs: Cow was in dry lot. Went off feed, coughing and labored breathing
• Presumptive Diagnosis: PM findings- traumatic pericarditis and abscess from hardware between reticulum and diaphragm
Classifying Submissions
• Cow was in dry lot. Went off feed, coughing and labored breathing
RespiratoryAnorexia
Classifying Submissions
• PM findings- traumatic pericarditis and abscess from hardware between reticulum and diaphragm
GI
TraumaCardiovascular
Classified Submissions
No Submissions
Chronic 20,698
Edema/Swelling/Neoplasia 4,583
Cardiovascular 1,537
GI 8,335
Respiratory 2,377
Trauma 3,708N = 35,721
Clinical Suspects
Total Submissions
35,721
Neuro + Behavior
4,583
Rule Outs 4,010
Clinical Suspects
573
Clinical Suspect ExamplesPresumptive Diagnosis Clinical Signs
UnknownWt loss, diffuse tremors, extended neck
None responsive hypomagnesaemia? Depression, ataxia, recumbencyNVL - hypocalcaemia? Wobbly and deadUndifferentiated, old age, possible liver failure
Apparent blind, incoordination, anorexia
PM - no significant findings
Does not react to sound, losing weight, mild intermittent ataxia hind and front limbs, head shy progressive
Unknown - neurological disorder? Agitation, head shyness
Neurological disorder? Untypical behaviour
Nervousness, uncooperative, charging people, going through fences
Veterinary Practice Surveillance
• Veterinary Practice Surveillance (VPS)– Cattle practitioners submit data about about cattle
to AAFRD daily via a restricted access website– Practitioners classify sick cattle by commodity
(cow-calf, dairy etc), age and syndrome (12)
• Large sample– 26,016 Submissions (Aug 05 – Dec 06)– 5,081 farms– 31 veterinary clinics
Submissions per day
020406080
100120140160180
9/1/
05
10/1
/05
11/1
/05
12/1
/05
1/1/
06
2/1/
06
3/1/
06
4/1/
06
5/1/
06
6/1/
06
7/1/
06
Date
Num
ber
VPS
BSE
Sept 2005 to July 2006
Respiratory Syndrome
VPS = Cattle greater than 30 months of age
0
0.02
0.04
0.06
0.080.1
0.12
0.14
0.16
0.18
1 2 3 4 5 6 7 8 9 10 11 12
Month
Pro
po
rtio
n
VPS
BSE
Clostridium hemolyticum
VPS = 75 cases, BSE = 157 cases
02468
101214161820
1 2 3 4 5 6 7 8 9 10 11 12
Month
Nu
mb
er
VPS
BSE
Utility ?
• Classifying/identifying “High Risk” • Generalize with caution (no
prevalence)– Sampling bias– Misclassification
• For each classification estimate:• Se and Sp of veterinarians • Se and Sp of text classifier
Utility ?
• But:– Large sample
• Disease importance or trends over time and space• Clostridium hemolyticum
– Events: syndromic, unknown, emerging• Establish normal patterns to identify unusual
events• Respond/investigate• Access for targeted surveillance
Questions?
• Our Team:– Clarissa Snyder– Lindsay McLarty– John Berezowski
• Contact us:[email protected]