data privacy in biomedicine dictionaries and rules lecture ... · 1 data privacy in biomedicine...
TRANSCRIPT
1
Data Privacy in Biomedicine
Lecture 7: More Scrubbing
Bradley Malin, PhD ([email protected])
Professor of Biomedical Informatics, Biostatistics, & Computer Science
Vanderbilt University
February 5, 2020
© 2020 Bradley Malin 2Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Today’s Lecture
◼ Dictionaries and Rules
Concept Match [B2W]
Medlee [B2W]
Lexicon [W2B]
◼ Machine Learning and Trained Systems
◼ Resynthesis
© 2020 Bradley Malin 3Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Concept Match (Berman ’03)
◼ Clinical Concept Dictionary-based approach
◼ If word in dictionary, then it remains in
document, otherwise removed
◼ Each retained concept is swapped for “synonym”
◼ Retains high frequency stop words
Berman JJ. Concept-match medical data scrubbing: how pathology text can be used in research. Archives of
Pathology and Laboratory Medicine. 2003; 127(6): 680-686. © 2020 Bradley Malin 4Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Matching Process (Berman ‘03)
1. Parse all input into sentence
2. Parse each sentence into words
3. Each stop word (high-frequency) is preserved in
original place
E.g., “the”, “a”, “of”
4. Map remaining words / phrases to standard
nomenclature (e.g., UMLS)
Large terms subsume smaller substrings
5. Replace by alternate term mapping to same concept
code
e.g. “renal cell carcinoma” → C0007134 → “rcc” or
“hypernephroma”
6. Non-mapped words are blocked out
Berman JJ. Concept-match medical data scrubbing: how pathology text can be used in research. Archives of
Pathology and Laboratory Medicine. 2003; 127(6): 680-686.
© 2020 Bradley Malin 5Data Privacy in Biomedicine: Lecture 7 – Scrubbing
UMLS
◼ Unified Medical Language System metathesaurus
◼ Very large / multi-purpose / multi-lingual vocabulary of
biomedical and health related concepts
◼ Over 100 different sources
International Classification of Diseases (ICD)
Current Procedural Terminology (CPT)
◼ Over 2 million medical terms
◼ Over 900,000 medical concepts
https://www.nlm.nih.gov/research/umls/
© 2020 Bradley Malin 6Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Concept Matching Sample(Berman ‘03)
1 2
3 4
5 6
2
© 2020 Bradley Malin 7Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Concept Matching Sample(Berman ‘03)
© 2020 Bradley Malin 8Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Match Limits (Berman ‘03)
◼ There are some limitations
Misspelled terms are automatically dropped
Synonym replacement may obscure semantic meaning
Does not handle ambiguous terms
Terms in dictionaries may be sensitive (e.g.,
“homicide”, “abuse”, ...)
© 2020 Bradley Malin 9Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Today’s Lecture
◼ Dictionaries and Rules
Concept Match [B2W]
Medlee [B2W]
Lexicon [W2B]
◼ Machine Learning and Trained Systems
◼ Resynthesis
© 2020 Bradley Malin 10Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Another Variation on Concepts(Morrison ‘09)
◼ MedLEE - Medical Language Extraction & Encoding System
http://www.medlingmap.org/taxonomy/term/80
◼ 100 Clinical follow-up notes (PHI annotated by human)
F. Morrison et al. Repurposing the clinical record: can an existing natural language processing system de-
identify clinical notes. Journal of the American Medical Informatics Association. 2009; 16: 37-39.
PHI Type Instances of PHI Instances in Output % Leaked
Age > 89 7 5 7.1%
Clinician 157 6 3.8%
Date 300 0 0%
Hospital 100 7 7%
Location 45 3 6.7%
Patient 126 4 3.2%
Telephone 33 1 3.0%
ID’s 41 0 0%
Total 809 26 3.2%
© 2020 Bradley Malin 11Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Another Variation on Concepts(Morrison ‘09)
◼ Leaked location:
“st” (meant Street or Saint hospital name) was
interpreted as part of EKG
◼ Examples of leaked names
Colors: “Green” and “Brown”
Common English: “Rose”
Disease Names: “Dias” vs. “Dias Disease
F. Morrison et al. Repurposing the clinical record: can an existing natural language processing system de-
identify clinical notes. Journal of the American Medical Informatics Association. 2009; 16: 37-39. © 2020 Bradley Malin 12Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Today’s Lecture
◼ Dictionaries and Rules
Concept Match [B2W]
Medlee [B2W]
Lexicon [W2B]
◼ Machine Learning and Trained Systems
◼ Resynthesis
7 8
9 10
11 12
3
© 2020 Bradley Malin 13Data Privacy in Biomedicine: Lecture 7 – Scrubbing
◼ Identifier discovery modeled as knowledge extraction
problem
◼ Exhaustively list patterns of names and numbers
“IDentity Marker” (IDM) [Sir, Mr., …]
followed by terms
IDM [maybe MD]
◼ think regular expression
Expressions specified for dates, phone numbers (nnn-nnnn), …
◼ Replaces names with “X” and terms with “x”
Ruch P, Baud RH, Rassinoux AM, Bouillon P, Robert G. Medical document anonymization with a semantic
lexicon. In Proceedings of the 2000 AMIA Annual Fall Symposium. 2000; 729-733.
Semantic Lexicon(Ruch et al ‘00)
© 2020 Bradley Malin 14Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Semantic Lexicon(Ruch et al ‘00)
◼ Performs word-sense disambiguation via morpho-
syntactic tagger (MS) then rule-based word sense (WS)
Rules ranked by “reliability”
◼ Added terms to MEDTAG Lexicon* (based on UMLS –
contains 5131 entries), for identifier detection
such as those focused on medical institutions, list of drugs,
medical device names
◼ Detects if potential identifier term is followed by actual
identifier
e.g., “Doctors observed” as opposed to “Doctors Smith and
Johnson observed”
*P. Ruch et al. MEDTAG: tag-like semantics for medical document indexing. Proc AMIA Symp. 1999.
© 2020 Bradley Malin 15Data Privacy in Biomedicine: Lecture 7 – Scrubbing
◼ Morpho-syntactic tagger (MS) makes the part-of-speech
explicit
Semantic Disambiguation(Ruch et al ‘00)
Tok Level
MS Level
v/cn {TOK:miss} ; np cn;pn
Given a word that is
ambiguous between a
v (verb) and cn
(common noun)
Little Miss Tuffet vs. I miss the diagnosis
If the word “miss” is
followed by an pn
(proper noun), then tag
as cn; otherwise pn
Ruch P, Baud RH, Rassinoux AM, Bouillon P, Robert G. Medical document anonymization with a semantic
lexicon. In Proceedings of the 2000 AMIA Annual Fall Symposium. 2000; 729-733. © 2020 Bradley Malin 16Data Privacy in Biomedicine: Lecture 7 – Scrubbing
◼ Rule-based word sense (WS) leverages previous round
of disambiguation to derive entity-specific predictions
Again, rules ranked by “reliability”
MS Level
WS Level
idm/pers rel {MS:sp} pers; rel
Given a word that is
ambiguous between an
idm and pers.
If the word “doctors” is
followed by a rel
(relationship), and then
by sp (“preposition” -
according to MS), then
tag as pers.
doctors said vs. Doctors Smith and Wesson
;
Semantic Disambiguation(Ruch et al ‘00)
Ruch P, Baud RH, Rassinoux AM, Bouillon P, Robert G. Medical document anonymization with a semantic
lexicon. In Proceedings of the 2000 AMIA Annual Fall Symposium. 2000; 729-733.
© 2020 Bradley Malin 17Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Extraction(Ruch et al ‘00)
◼ Extraction module processes the 3 level stream
(token → MS level → WS level)
◼ Switches on extraction mode when reads token
tagged as id from WS level
◼ Switches off when it hits barrier (i.e., token not
tagged as id)
◼ Specialized rules to handle multi-part last names
(e.g., “van Winkle”)
Ruch P, Baud RH, Rassinoux AM, Bouillon P, Robert G. Medical document anonymization with a semantic
lexicon. In Proceedings of the 2000 AMIA Annual Fall Symposium. 2000; 729-733. © 2020 Bradley Malin 18Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Evaluation(Ruch et al ‘00)
◼ 1000 medical documents from University Hospital Geneva
80,784 tokens
600 Post-operative reports
200 Laboratory and test results
200 discharge summaries
◼ Set A: 20% of documents for training
◼ Set B: 80% for testing
◼ If word is not observed in training – throw it out
Ruch P, Baud RH, Rassinoux AM, Bouillon P, Robert G. Medical document anonymization with a semantic
lexicon. In Proceedings of the 2000 AMIA Annual Fall Symposium. 2000; 729-733.
13 14
15 16
17 18
4
© 2020 Bradley Malin 19Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Semantic Tagset (Ruch et al ‘00)
Tag Frequency Definition Example
1 qual 0.101 Qualifier fat
2 acto 0.095 General act leave
3 loc 0.093 Organ / body location liver
4 spat 0.087 Spatial concept high
5 temp 0.053 Temporal concept late
6 mod 0.051 Modal maybe
7 quant 0.047 Quantitative concept five
8 papr 0.045 Pathological process infection
9 find 0.042 Signs or symptoms fever
10 cpt 0.041 Other concept idea
. … … … …
31 idm 0.006 Identity Marker Dr.
?? id <<0.001 Identifier Proper
Noun
Louise
© 2020 Bradley Malin 20Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Semantic Tagset(Ruch et al ‘00)
0
0.02
0.04
0.06
0.08
0.1
0.12
0 10 20 30 40
Fre
qu
en
cy
Term Rank
idmid
Ruch P, Baud RH, Rassinoux AM, Bouillon P, Robert G. Medical document anonymization with a semantic
lexicon. In Proceedings of the 2000 AMIA Annual Fall Symposium. 2000; 729-733.
© 2020 Bradley Malin 21Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Evaluation (Ruch et al ‘00)
◼ Over 40 rules written based on Set A
Took ~3 weeks
Observed 124 identifiers in 16,456 tokens
◼ Six types of results
Identifiers in corpus 467 100%
Identifiers correctly removed (ICR) 452 96.8%98.5%
ICR + additional terms 8 1.7%
Identifiers incompletely removed 3 0.6%1.5%
Identifiers left in text 4 0.9%
Non-identifier tokens removed 0 0%
Ruch P, Baud RH, Rassinoux AM, Bouillon P, Robert G. Medical document anonymization with a semantic
lexicon. In Proceedings of the 2000 AMIA Annual Fall Symposium. 2000; 729-733. © 2020 Bradley Malin 22Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Semantic Lexicon(Ruch et al ‘00)
◼ Limitations
Requires exhaustive specification
Hand curation of rules
Claim of generalizability, but no proof
Ruch P, Baud RH, Rassinoux AM, Bouillon P, Robert G. Medical document anonymization with a semantic
lexicon. In Proceedings of the 2000 AMIA Annual Fall Symposium. 2000; 729-733.
© 2020 Bradley Malin 23Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Today’s Lecture
◼ Dictionaries & Rules
◼ Machine Learning and Trained Systems
◼ Resynthesis
© 2020 Bradley Malin 24Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Trained Semantic Templates(TBK ‘02)
◼ Tagged references for name and local context
e.g., “Johnny” tagged as name
e.g., “underwent” tagged as context
e.g., type of surgery. etc.
◼ Made logical relation of predicate and ordered list of 1
argument
◼ Predicates defined by word order, not spacing
◼ Calculated frequency of relations in training set
Taira R, Bui A, Kangarloo H. Identification of patient name references within medical documents. In
Proceedings of the 2002 AMIA Annual Fall Symposium. 2002; 757-761.
19 20
21 22
23 24
5
© 2020 Bradley Malin 25Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Example of Predicates
Predicate Relative Frequency Example
Patient-healthStatus 0.189 John was doing well
© 2020 Bradley Malin 26Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Example of Predicates
Predicate Relative Frequency Example
Patient-healthStatus 0.189 John was doing well
Patient-age 0.181 John is 3 years old
© 2020 Bradley Malin 27Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Example of Predicates
Predicate Relative Frequency Example
Patient-healthStatus 0.189 John was doing well
Patient-age 0.181 John is 3 years old
Patient-condition 0.140 John developed a fever
Patient-procedure 0.109 John received therapy
Patient-gender 0.108 John is a 5 year-old male
Patient-anaphora 0.102 John is a patient with …
Patient-ADT 0.061 John was discharged
Patient-relative 0.035 John’s mother
Patient-ethnicity 0.028 John is an Asian male
Patient-heightWeight 0.022 John is a chubby male
© 2020 Bradley Malin 28Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Trained Semantic Templates(TBK ‘02)
◼ Algorithm
For each token
◼ If token not excluded***
Locate all possible logical relational constructs relating to
an identifier are associated with the token
For each construct
▪ Determine the probability that the token satisfies the
construct
▪ If the probability > threshold, then predict identifier
© 2020 Bradley Malin 29Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Exclusions
◼ Drug name list (6,200 entries)
◼ Part of physician name based on tokens
e.g., “Dr.”, “M.D.”
◼ Followed by diagnostic qualifier
e.g., “Syndrome”, “Disease”, “Procedure”
◼ Part of department or institution
e.g., “Medical Center”
◼ Associated with article / determiner attachment
© 2020 Bradley Malin 30Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Classification Model(TBK ‘02)
◼ 2-class problem w / maximum entropy model and log-linear basis
◼ Constructs “learned” from training set
Example: John is a 5 year old male with disease X…
Construct: isofAge(John, 5 year old)
Construct: isofGender(John, male)
b = vector of terms associated with construct
fi = “indicator” functions associated with terms, such as word ordering
i = weight associated with feature (from training)
Z = normalization constant (mass over all classes of predictions)
25 26
27 28
29 30
6
© 2020 Bradley Malin 31Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Alternative Models
◼ Any term trained classifier can be applied
Support vector machines
Naïve Bayes
Boosted Decision Trees
Conditional Random Fields
Recurrent Neural Networks (Deep Learning)
Uzuner, et al. A de-identifier for medical discharge summaries. Artif Intell Med. 2008;42: 13-35.
Wellner, et al. Rapidly retargetable approaches to de-identification in medical records. J Am Med Inform Assoc. 2007;14:564-73.
© 2020 Bradley Malin 32Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Conditional Random Fields (CRFs)
◼ Define the conditional probability of a tag (i.e.,
label) sequence
given an observed set sequence of tokens
is
Wellner B, et al. Rapidly retargetable approaches to de-identification in medical records. Journal of the
American Medical Informatics Association. 2007: 564-573.
Feature function
© 2020 Bradley Malin 33Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Conditional Random Fields (CRFs)
◼ Each feature function is basically a predicate over
a particular configuration of the observation
relative to the current position t, for a particular
label pair at, at-1
◼ Feature weights indicate how strongly the
predicate (over the observations) correlates with a
particular label pair
Wellner B, et al. Rapidly retargetable approaches to de-identification in medical records. Journal of the
American Medical Informatics Association. 2007: 564-573. © 2020 Bradley Malin 34Data Privacy in Biomedicine: Lecture 7 – Scrubbing
CRF Example
◼ Feature for a contextual cue for “Dr.”
An indication we’re about to begin a DOCTOR
phrase
Dt Ba =if
Copy to Dr. Stone , U. BATESSE HOSPITAL
O O O BD
Outside of
a phraseBeginning
of a
“doctor”
phrase
O BH
Beginning
of a
“hospital
phrase
IHIH
Inside or end of
“hospital phrase”
and
Oat =−1
( )=−1tbWORD
and
“Dr”
© 2020 Bradley Malin 35Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Conditional Random Fields (CRFs)
◼ Feature weights (lambdas) come from
maximizing the conditional log likelihoods of the
training data D
◼ Latter term is a penalty to prevent overfitting
◼ Maximization achieved through iterative gradient
descent on the function
◼ Most likely label sequence from Viterbi (dynamic
programming) algorithm
Wellner B, et al. Rapidly retargetable approaches to de-identification in medical records. Journal of the
American Medical Informatics Association. 2007: 564-573.
𝒍𝒊𝒌𝒆𝒍𝒊𝒉𝒐𝒐𝒅𝚲 𝑫 =
𝒂,𝒃
𝒍𝒐𝒈𝑷 𝒂|𝒃 + 𝑹𝝈 𝚲
© 2020 Bradley Malin 36Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Biasing the CRF
◼ Works on the “outside of phrase” scenario
◼ O weight for the corresponding feature
◼ Large negative values → label tokens as identifiers
◼ Large positive values → label tokens as non-identifiers
◼ Tune O using a Gauss-Newton line search [see Machine Learning Course]
◼ Terminate when evaluation results (e.g., recall) differ by a
small amount (e.g., 0.01%)
Minkov E, et al. NER Systems that suit user’s preferences: adjusting the recall-precision tradeoff for entity
extraction. Proceedings of the Human Language Technology Conference of the NAACL. 2006: 93-96.
Oat =if and
0 otherwise
𝒇𝑶 𝒂𝒕, 𝒂𝒕−𝟏, 𝒃, 𝒕 =1
31 32
33 34
35 36
7
© 2020 Bradley Malin 37Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Bigger Picture of the Process
L. Deleger, et al. Large-scale
evaluation of automated clinical note
de-identification and its impact on
information extraction. J Am Med
Inform Assoc. 2013 Jan 1;20(1):84-94.
© 2020 Bradley Malin 38Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Trained Semantic Templates(TBK ‘02)
◼ Trained system on 1350 pediatric reports
Tagged references for name and local context
◼ e.g., “Johnny” tagged as name
◼ e.g., “underwent” tagged as context
◼ e.g., type of surgery
Taira R, Bui A, Kangarloo H. Identification of patient name references within medical documents. In
Proceedings of the 2002 AMIA Annual Fall Symposium. 2002; 757-761.
© 2020 Bradley Malin 39Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Trained Semantic Templates(TBK ‘02)
◼ Out of 1350 pediatric reports
36% of documents contained name
907 name instances found in non-header
information
◼ Tested with 900 records
ROC of 0.9735
Operating point of 0.55 threshold
◼ 99.2% Precision and 93.9% Recall
Is this good enough?
◼ False Positives
Valid name syntax, but semantically
incorrect
© 2020 Bradley Malin 40Data Privacy in Biomedicine: Lecture 7 – Scrubbing
◼ False Positives
Valid name syntax, but semantically incorrect
◼ “Dear Mark, Robert was in our office today”
Identification of a patient’s relative rather than the patient
◼ “Johnny’s sister Mary is 7 years old”
Patient and physician have same name
Rare use of gender description not describing Patient name
◼ “Tanner 4 female”
Drug names that could not be ruled out
Medical conditions that could not be ruled out
Trained Semantic Templates(TBK ‘02)
© 2020 Bradley Malin 41Data Privacy in Biomedicine: Lecture 7 – Scrubbing
◼ Limitations
False Negatives?
Logical relation not modeled
Grammatically difficult expressions
Only name references, not all HIPAA identifiers
May require retraining for each type of dataset
(pediatric versus cardiology)
May identify false semantic templates in training
Trained Semantic Templates(TBK ‘02)
© 2020 Bradley Malin 42Data Privacy in Biomedicine: Lecture 7 – Scrubbing
(Back to) The AMIA “Bakeoff”
◼ 2006 Natural Language Processing Challenge at
the American Medical Informatics Association
Annual Symposium (AMIA)
◼ 889 records from Partners Healthcare (Boston)
669 for training, 220 for testing
◼ Classes for challenge:
Patients
Doctors
Hospitals
IDs
Dates
Locations
Phone #’s
Ages over 90
O. Uzuner, et al. J Am Med Inform Assoc. 2007 Sep-Oct; 14(5): 550–563
37 38
39 40
41 42
8
© 2020 Bradley Malin 43Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Performance Measure Levels
◼ Token–level scores: measure performance for each token
◼ Instance–level scores*: Extends the model to account for…
Type: PHI type in the instance (e.g., patient vs. doctor)
Content: terms included in the instance (e.g., “Dr. John Smith”)
Extent: beginning and ending of the instance
*Model used by NIST in their named entity recognition (NER) tasks
© 2020 Bradley Malin 44Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Instance-Level Performance
◼ Instead of true and false
we have correct, incorrect, missing, and spurious
◼ Total number of correct entities
◼ Substitution Error
◼ Insertion Error
◼ Deletion Error
== = otherwise
ccC e
entities
e
e0
correct all areextent & content, type,if1 where,
#
1
== = otherwise
ssS e
entities
e
e0
incorrect isextent & content, type,of one if1 where,
#
1
== = otherwise
iiI e
entities
e
e0
spurious all areextent & content, type,if1 where,
#
1
== = otherwise
ddD e
entities
e
e0
missing all areextent & content, type,if1 where,
#
1
© 2020 Bradley Malin 45Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Examples◼ 12/07
Ground truth = 12 → date; 07 → non-PHI
Predict 12/07 is ID Incorrect type
Predict 12/07 is date
◼ Incorrect type, and
◼ Substitution error
◼ Usually: all partial matches are considered substitution errors
◼ Mendelian Gene
Ground truth = non-PHI
Predict Mendelian Gene is Name
◼ Spurious type, content, and extent → Insertion error
◼ John Smith
Ground truth = name
Predict non-PHI for both
◼ Missing type, content, and extent → Deletion error
© 2020 Bradley Malin 46Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Instance Level Metrics
◼ Instance Level Precision (ILP)
C / (C + S + I)
◼ Instance Level Recall (ILR)
C / (C + S + D)
◼ F-measure = 2* ILP * ILR / (ILP + ILR)
© 2020 Bradley Malin 47Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Significance Testing◼ Null hypothesis: absolute difference in performances of two systems
on your favorite criteria (recall, precision, F) is ~ 0.
◼ Randomization technique: Shuffle a system’s responses to “units” in
the test set N times (e.g., N = 9999).
◼ Create N pairs of pseudo-systems
◼ Count the number of times, n, when difference between the
performances of the pseudo-system pairs is greater than the
difference between the performance of the two actual systems
s = (n + 1) / (N + 1)
◼ If s > threshold, then the difference is explained by chance. Otherwise
difference is significant at the threshold level.
AMIA Test: Threshold set to 0.1; “unit” equals all tokens (or instances) in a record
N. Chinchor. The statistical significance of the MUC-4 Results. In Proceedings of the 4th Conference on
Message Understanding. 1992: 30-50.
© 2020 Bradley Malin 48Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Competition Systems
O. Uzuner, et al. J Am Med Inform Assoc. 2007 Sep-Oct; 14(5): 550–563
43 44
45 46
47 48
9
© 2020 Bradley Malin 49Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Competition Systems
© 2020 Bradley Malin 50Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Overall
O. Uzuner, et al. J Am Med Inform Assoc. 2007 Sep-Oct; 14(5): 550–563
Rules
© 2020 Bradley Malin 51Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Rules
Instance-Level Comparison
Token-Level Comparison
© 2020 Bradley Malin 52Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Some Observations
◼ Machine learning outperforms rules based
systems
◼ Performance is never optimal
E.g., phone numbers with strange formats
◼ Machine learning can overfit
E.g., training on Mr. Smith / Mrs. Jones, when test set is
John Smith / Jane Jones
© 2020 Bradley Malin 53Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Software: From Theory to Practice
with Conditional Random Fields
HIDE (Gardner & Xiong 2009)
J. Gardner, L. Xiong. An integrated framework for de-identifying unstructured medical data. Data and Knowledge Engineering
(DKE), 2009; 68(12).
© 2020 Bradley Malin 54Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Software: From Theory to Practice
with Conditional Random Fields
HIDE (Gardner & Xiong 2009) MIST (Aberdeen et al 2010)
J. Aberdeen, et al. The MITRE Identification Scrubber Toolkit: design, training, and assessment.. Int J Med Inform.
2010;79(12):849-59.
49 50
51 52
53 54
10
© 2020 Bradley Malin 55Data Privacy in Biomedicine: Lecture 7 – Scrubbing
MIST Installation & Training
© 2020 Bradley Malin 56Data Privacy in Biomedicine: Lecture 7 – Scrubbing
CRFs (MIST) Beyond AMIA(Aberdeen et al. 2010)
Discharge Laboratory Letter Order All
Train 200 400 200 400 1200
Test 50 100 50 100 300
Precision 0.946 0.905 0.931 0.993 0.943
Recall 0.986 0.966 0.956 0.999 0.978
Precision: 0.91 – 0.99 Recall: 0.95 – 0.99
◼ Vanderbilt’s EMR (No Name or Place Dictionaries invoked)
◼ Specialized version of DE-ID provides the “gold standard”
◼ De-identification model based on CRFs
◼ Four document classes: Discharge Summaries (DS), Letters, Labs, Orders
© 2020 Bradley Malin 57Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Findings With Diverse EMRs
◼ Another alternative: MCRF (Mallet Conditional Random Field) - Cincinnati
Children’s Hospital
◼ Not all types of identifiers are found at the same rate
◼ ~3500 clinical notes over 22 note types > 30,000 identifiers
Virtually
indistinguishable
from human
de-identification
L. Deleger, et al. Large-scale evaluation of automated clinical note de-identification and its impact on information extraction. J
Am Med Inform Assoc. 2013 Jan 1;20(1):84-94.
© 2020 Bradley Malin 58Data Privacy in Biomedicine: Lecture 7 – Scrubbing
◼ CRF Scrubbing @ Cincinnati
◼ ~3500 clinical notes over 22 note types
Negligible Impact on Medication Extraction(Deleger et al. 2013)
Original Notes Scrubbed Notes
Precision 96.3 96.3 – 96.5
Recall 89.3 88.9 – 89.5
F-measure 92.6 92.5 – 92.7
© 2020 Bradley Malin 59Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Can you Trust Synthetic Results?
◼ AMIA challenge used resynthesized data
◼ Do the results hold true on real data?
◼ What are the limits of resynthesis?
© 2020 Bradley Malin 60Data Privacy in Biomedicine: Lecture 7 – Scrubbing
An Experimental Model(Yeniterzi et al, 2010)
◼ OO: de-identification model was trained and tested
with original medical records
replicates ideal training and evaluation
◼ RR: model was trained and tested with resynthesized
medical records
replicates AMIA evaluation
◼ OR: model was trained with original and tested on
resynthesized
replicates ideal training and evaluation
◼ RO: model was trained with resynthesized and tested
with original
replicates “off the shelf” applicationR. Yeniterzi, et al. Effects of personal identifier resynthesis on clinical text de-identification. Journal of the American Medical
Informatics Association. 2010; 17: 59-68.
55 56
57 58
59 60
11
© 2020 Bradley Malin 61Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Environment(Yeniterzi et al, 2010)
◼ Vanderbilt’s EMR
◼ Specialized version of DE-ID provides the “gold standard”
◼ De-identification model based on CRFs in MIST
◼ Resynthesis is improved model of AMIA (more realistic in replaced terms)
◼ Four document classes: Discharge Summaries (DS), Letters, Labs, Orders
◼ Fifth class uses 50 documents from each class for train, and all test
documents
Record
Class
Evaluation
Train Test
DS 200 50
LETTER 200 50
LAB 400 100
ORDER 400 100
HYBRID 200 300
© 2020 Bradley Malin 62Data Privacy in Biomedicine: Lecture 7 – Scrubbing
A Systemic Analysis(Yeniterzi et al, 2010)
DE-ID+ ReLinking
ResynthesisEngine
MIST ORIG MODEL
MIST RESYNTH MODEL
MITRE Model Builder (Carafe)
OriginalRecords
Annotated OriginalRecords
TRAINOrig
TESTOrig
TESTResynth
TRAINResynth
ResynthesizedRecords
Exp 1
MITRE Model Builder (Carafe)
Exp 2
Exp 3Score O-R
Exp 4
Score O-OScore R-R
Score R-O
© 2020 Bradley Malin 63Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Record ClassRecall Precision F-Measure Accuracy
PHI Exposure
(1-label-blind recall)
OO Experiment
DS 0.986 0.946 0.966 0.993 0.014
LAB 0.966 0.905 0.935 0.983 0.034
LETTER 0.956 0.931 0.944 0.986 0.040
ORDER 0.999 0.993 0.996 0.999 0.001
AGGREGATE 0.978 0.943 0.960 0.990 0.022
HYBRID 0.962 0.925 0.943 0.986 0.035
RR Experiment
DS 0.986 0.972 0.979 0.998 0.010
LAB 0.995 0.991 0.993 0.999 0.005
LETTER 0.965 0.962 0.963 0.996 0.032
ORDER 0.990 0.989 0.989 0.999 0.010
AGGREGATE 0.983 0.977 0.980 0.998 0.014
HYBRID 0.970 0.960 0.965 0.997 0.022
OR Experiment
DS 0.871 0.919 0.894 0.990 0.101
LAB 0.731 0.843 0.783 0.987 0.268
LETTER 0.832 0.910 0.869 0.987 0.155
ORDER 0.788 0.984 0.875 0.992 0.212
AGGREGATE 0.816 0.913 0.862 0.989 0.171
HYBRID 0.842 0.911 0.875 0.990 0.147
RO Experiment
DS 0.674 0.887 0.766 0.961 0.324
LAB 0.348 0.723 0.470 0.899 0.652
LETTER 0.769 0.852 0.808 0.955 0.224
ORDER 0.766 0.834 0.799 0.926 0.234
AGGREGATE 0.642 0.841 0.728 0.942 0.355
HYBRID 0.404 0.789 0.535 0.914 0.592
© 2020 Bradley Malin 64Data Privacy in Biomedicine: Lecture 7 – Scrubbing
Readings for Next Lecture
◼ E. Ratliff. Writer Evan Ratliff tried to vanish: here's what happened. Wired. November 20, 2009.
◼ V. BLue. Strava's fitness heatmaps are a "potential catastrophe". Engadget. February 2, 2018.
Optional
◼ A. Acquisi and R. Gross. Predicting Social Security Numbers from public data. Proceedings of the
National Academy of Sciences USA. 2009; 106(27): 10975-10980.
◼ V. Griffith and M. Jacobsson. Messin' with Texas: Deriving mother's maiden names using public
records. Proceedings of the Applied Cryptography and Network Security Conference. 2005: 91-
103.
◼ S. Munson, et al. Attitudes towards online availability of US public records. Proceedings of the
12th Annual International Digital Government Research Conference. 2011: 2-9.
◼ G. Friedland, et al. Sherlock Holmes' evil twin: on the impact of global inference for online privacy.
Proceedings of the Workshop on New Security Paradigms Workshop. 2011: 105-114.
61 62
63 64