data privacy in biomedicine dictionaries and rules lecture ... · 1 data privacy in biomedicine...

11
1 Data Privacy in Biomedicine Lecture 7: More Scrubbing Bradley Malin, PhD ([email protected]) Professor of Biomedical Informatics, Biostatistics, & Computer Science Vanderbilt University February 5, 2020 © 2020 Bradley Malin 2 Data Privacy in Biomedicine: Lecture 7 Scrubbing Today’s Lecture Dictionaries and Rules Concept Match [B2W] Medlee [B2W] Lexicon [W2B] Machine Learning and Trained Systems Resynthesis © 2020 Bradley Malin 3 Data Privacy in Biomedicine: Lecture 7 Scrubbing Concept Match (Berman ’03) Clinical Concept Dictionary-based approach If word in dictionary, then it remains in document, otherwise removed Each retained concept is swapped for “synonym” Retains high frequency stop words Berman JJ. Concept-match medical data scrubbing: how pathology text can be used in research. Archives of Pathology and Laboratory Medicine. 2003; 127(6): 680-686. Matching Process (Berman ‘03) 1. Parse all input into sentence 2. Parse each sentence into words 3. Each stop word (high-frequency) is preserved in original place E.g., “the”, “a”, “of” 4. Map remaining words / phrases to standard nomenclature (e.g., UMLS) Large terms subsume smaller substrings 5. Replace by alternate term mapping to same concept code e.g. “renal cell carcinoma” C0007134 “rcc” or “hypernephroma” 6. Non-mapped words are blocked out Berman JJ. Concept-match medical data scrubbing: how pathology text can be used in research. Archives of Pathology and Laboratory Medicine. 2003; 127(6): 680-686. © 2020 Bradley Malin 5 Data Privacy in Biomedicine: Lecture 7 Scrubbing UMLS Unified Medical Language System metathesaurus Very large / multi-purpose / multi-lingual vocabulary of biomedical and health related concepts Over 100 different sources International Classification of Diseases (ICD) Current Procedural Terminology (CPT) Over 2 million medical terms Over 900,000 medical concepts https://www.nlm.nih.gov/research/umls/ © 2020 Bradley Malin 6 Data Privacy in Biomedicine: Lecture 7 Scrubbing Concept Matching Sample (Berman ‘03) 1 2 3 4 5 6

Upload: others

Post on 02-Aug-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Privacy in Biomedicine Dictionaries and Rules Lecture ... · 1 Data Privacy in Biomedicine Lecture 7: More Scrubbing Bradley Malin, PhD (b.malin@vanderbilt.edu) Professor of

1

Data Privacy in Biomedicine

Lecture 7: More Scrubbing

Bradley Malin, PhD ([email protected])

Professor of Biomedical Informatics, Biostatistics, & Computer Science

Vanderbilt University

February 5, 2020

© 2020 Bradley Malin 2Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Today’s Lecture

◼ Dictionaries and Rules

Concept Match [B2W]

Medlee [B2W]

Lexicon [W2B]

◼ Machine Learning and Trained Systems

◼ Resynthesis

© 2020 Bradley Malin 3Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Concept Match (Berman ’03)

◼ Clinical Concept Dictionary-based approach

◼ If word in dictionary, then it remains in

document, otherwise removed

◼ Each retained concept is swapped for “synonym”

◼ Retains high frequency stop words

Berman JJ. Concept-match medical data scrubbing: how pathology text can be used in research. Archives of

Pathology and Laboratory Medicine. 2003; 127(6): 680-686. © 2020 Bradley Malin 4Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Matching Process (Berman ‘03)

1. Parse all input into sentence

2. Parse each sentence into words

3. Each stop word (high-frequency) is preserved in

original place

E.g., “the”, “a”, “of”

4. Map remaining words / phrases to standard

nomenclature (e.g., UMLS)

Large terms subsume smaller substrings

5. Replace by alternate term mapping to same concept

code

e.g. “renal cell carcinoma” → C0007134 → “rcc” or

“hypernephroma”

6. Non-mapped words are blocked out

Berman JJ. Concept-match medical data scrubbing: how pathology text can be used in research. Archives of

Pathology and Laboratory Medicine. 2003; 127(6): 680-686.

© 2020 Bradley Malin 5Data Privacy in Biomedicine: Lecture 7 – Scrubbing

UMLS

◼ Unified Medical Language System metathesaurus

◼ Very large / multi-purpose / multi-lingual vocabulary of

biomedical and health related concepts

◼ Over 100 different sources

International Classification of Diseases (ICD)

Current Procedural Terminology (CPT)

◼ Over 2 million medical terms

◼ Over 900,000 medical concepts

https://www.nlm.nih.gov/research/umls/

© 2020 Bradley Malin 6Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Concept Matching Sample(Berman ‘03)

1 2

3 4

5 6

Page 2: Data Privacy in Biomedicine Dictionaries and Rules Lecture ... · 1 Data Privacy in Biomedicine Lecture 7: More Scrubbing Bradley Malin, PhD (b.malin@vanderbilt.edu) Professor of

2

© 2020 Bradley Malin 7Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Concept Matching Sample(Berman ‘03)

© 2020 Bradley Malin 8Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Match Limits (Berman ‘03)

◼ There are some limitations

Misspelled terms are automatically dropped

Synonym replacement may obscure semantic meaning

Does not handle ambiguous terms

Terms in dictionaries may be sensitive (e.g.,

“homicide”, “abuse”, ...)

© 2020 Bradley Malin 9Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Today’s Lecture

◼ Dictionaries and Rules

Concept Match [B2W]

Medlee [B2W]

Lexicon [W2B]

◼ Machine Learning and Trained Systems

◼ Resynthesis

© 2020 Bradley Malin 10Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Another Variation on Concepts(Morrison ‘09)

◼ MedLEE - Medical Language Extraction & Encoding System

http://www.medlingmap.org/taxonomy/term/80

◼ 100 Clinical follow-up notes (PHI annotated by human)

F. Morrison et al. Repurposing the clinical record: can an existing natural language processing system de-

identify clinical notes. Journal of the American Medical Informatics Association. 2009; 16: 37-39.

PHI Type Instances of PHI Instances in Output % Leaked

Age > 89 7 5 7.1%

Clinician 157 6 3.8%

Date 300 0 0%

Hospital 100 7 7%

Location 45 3 6.7%

Patient 126 4 3.2%

Telephone 33 1 3.0%

ID’s 41 0 0%

Total 809 26 3.2%

© 2020 Bradley Malin 11Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Another Variation on Concepts(Morrison ‘09)

◼ Leaked location:

“st” (meant Street or Saint hospital name) was

interpreted as part of EKG

◼ Examples of leaked names

Colors: “Green” and “Brown”

Common English: “Rose”

Disease Names: “Dias” vs. “Dias Disease

F. Morrison et al. Repurposing the clinical record: can an existing natural language processing system de-

identify clinical notes. Journal of the American Medical Informatics Association. 2009; 16: 37-39. © 2020 Bradley Malin 12Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Today’s Lecture

◼ Dictionaries and Rules

Concept Match [B2W]

Medlee [B2W]

Lexicon [W2B]

◼ Machine Learning and Trained Systems

◼ Resynthesis

7 8

9 10

11 12

Page 3: Data Privacy in Biomedicine Dictionaries and Rules Lecture ... · 1 Data Privacy in Biomedicine Lecture 7: More Scrubbing Bradley Malin, PhD (b.malin@vanderbilt.edu) Professor of

3

© 2020 Bradley Malin 13Data Privacy in Biomedicine: Lecture 7 – Scrubbing

◼ Identifier discovery modeled as knowledge extraction

problem

◼ Exhaustively list patterns of names and numbers

“IDentity Marker” (IDM) [Sir, Mr., …]

followed by terms

IDM [maybe MD]

◼ think regular expression

Expressions specified for dates, phone numbers (nnn-nnnn), …

◼ Replaces names with “X” and terms with “x”

Ruch P, Baud RH, Rassinoux AM, Bouillon P, Robert G. Medical document anonymization with a semantic

lexicon. In Proceedings of the 2000 AMIA Annual Fall Symposium. 2000; 729-733.

Semantic Lexicon(Ruch et al ‘00)

© 2020 Bradley Malin 14Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Semantic Lexicon(Ruch et al ‘00)

◼ Performs word-sense disambiguation via morpho-

syntactic tagger (MS) then rule-based word sense (WS)

Rules ranked by “reliability”

◼ Added terms to MEDTAG Lexicon* (based on UMLS –

contains 5131 entries), for identifier detection

such as those focused on medical institutions, list of drugs,

medical device names

◼ Detects if potential identifier term is followed by actual

identifier

e.g., “Doctors observed” as opposed to “Doctors Smith and

Johnson observed”

*P. Ruch et al. MEDTAG: tag-like semantics for medical document indexing. Proc AMIA Symp. 1999.

© 2020 Bradley Malin 15Data Privacy in Biomedicine: Lecture 7 – Scrubbing

◼ Morpho-syntactic tagger (MS) makes the part-of-speech

explicit

Semantic Disambiguation(Ruch et al ‘00)

Tok Level

MS Level

v/cn {TOK:miss} ; np cn;pn

Given a word that is

ambiguous between a

v (verb) and cn

(common noun)

Little Miss Tuffet vs. I miss the diagnosis

If the word “miss” is

followed by an pn

(proper noun), then tag

as cn; otherwise pn

Ruch P, Baud RH, Rassinoux AM, Bouillon P, Robert G. Medical document anonymization with a semantic

lexicon. In Proceedings of the 2000 AMIA Annual Fall Symposium. 2000; 729-733. © 2020 Bradley Malin 16Data Privacy in Biomedicine: Lecture 7 – Scrubbing

◼ Rule-based word sense (WS) leverages previous round

of disambiguation to derive entity-specific predictions

Again, rules ranked by “reliability”

MS Level

WS Level

idm/pers rel {MS:sp} pers; rel

Given a word that is

ambiguous between an

idm and pers.

If the word “doctors” is

followed by a rel

(relationship), and then

by sp (“preposition” -

according to MS), then

tag as pers.

doctors said vs. Doctors Smith and Wesson

;

Semantic Disambiguation(Ruch et al ‘00)

Ruch P, Baud RH, Rassinoux AM, Bouillon P, Robert G. Medical document anonymization with a semantic

lexicon. In Proceedings of the 2000 AMIA Annual Fall Symposium. 2000; 729-733.

© 2020 Bradley Malin 17Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Extraction(Ruch et al ‘00)

◼ Extraction module processes the 3 level stream

(token → MS level → WS level)

◼ Switches on extraction mode when reads token

tagged as id from WS level

◼ Switches off when it hits barrier (i.e., token not

tagged as id)

◼ Specialized rules to handle multi-part last names

(e.g., “van Winkle”)

Ruch P, Baud RH, Rassinoux AM, Bouillon P, Robert G. Medical document anonymization with a semantic

lexicon. In Proceedings of the 2000 AMIA Annual Fall Symposium. 2000; 729-733. © 2020 Bradley Malin 18Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Evaluation(Ruch et al ‘00)

◼ 1000 medical documents from University Hospital Geneva

80,784 tokens

600 Post-operative reports

200 Laboratory and test results

200 discharge summaries

◼ Set A: 20% of documents for training

◼ Set B: 80% for testing

◼ If word is not observed in training – throw it out

Ruch P, Baud RH, Rassinoux AM, Bouillon P, Robert G. Medical document anonymization with a semantic

lexicon. In Proceedings of the 2000 AMIA Annual Fall Symposium. 2000; 729-733.

13 14

15 16

17 18

Page 4: Data Privacy in Biomedicine Dictionaries and Rules Lecture ... · 1 Data Privacy in Biomedicine Lecture 7: More Scrubbing Bradley Malin, PhD (b.malin@vanderbilt.edu) Professor of

4

© 2020 Bradley Malin 19Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Semantic Tagset (Ruch et al ‘00)

Tag Frequency Definition Example

1 qual 0.101 Qualifier fat

2 acto 0.095 General act leave

3 loc 0.093 Organ / body location liver

4 spat 0.087 Spatial concept high

5 temp 0.053 Temporal concept late

6 mod 0.051 Modal maybe

7 quant 0.047 Quantitative concept five

8 papr 0.045 Pathological process infection

9 find 0.042 Signs or symptoms fever

10 cpt 0.041 Other concept idea

. … … … …

31 idm 0.006 Identity Marker Dr.

?? id <<0.001 Identifier Proper

Noun

Louise

© 2020 Bradley Malin 20Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Semantic Tagset(Ruch et al ‘00)

0

0.02

0.04

0.06

0.08

0.1

0.12

0 10 20 30 40

Fre

qu

en

cy

Term Rank

idmid

Ruch P, Baud RH, Rassinoux AM, Bouillon P, Robert G. Medical document anonymization with a semantic

lexicon. In Proceedings of the 2000 AMIA Annual Fall Symposium. 2000; 729-733.

© 2020 Bradley Malin 21Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Evaluation (Ruch et al ‘00)

◼ Over 40 rules written based on Set A

Took ~3 weeks

Observed 124 identifiers in 16,456 tokens

◼ Six types of results

Identifiers in corpus 467 100%

Identifiers correctly removed (ICR) 452 96.8%98.5%

ICR + additional terms 8 1.7%

Identifiers incompletely removed 3 0.6%1.5%

Identifiers left in text 4 0.9%

Non-identifier tokens removed 0 0%

Ruch P, Baud RH, Rassinoux AM, Bouillon P, Robert G. Medical document anonymization with a semantic

lexicon. In Proceedings of the 2000 AMIA Annual Fall Symposium. 2000; 729-733. © 2020 Bradley Malin 22Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Semantic Lexicon(Ruch et al ‘00)

◼ Limitations

Requires exhaustive specification

Hand curation of rules

Claim of generalizability, but no proof

Ruch P, Baud RH, Rassinoux AM, Bouillon P, Robert G. Medical document anonymization with a semantic

lexicon. In Proceedings of the 2000 AMIA Annual Fall Symposium. 2000; 729-733.

© 2020 Bradley Malin 23Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Today’s Lecture

◼ Dictionaries & Rules

◼ Machine Learning and Trained Systems

◼ Resynthesis

© 2020 Bradley Malin 24Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Trained Semantic Templates(TBK ‘02)

◼ Tagged references for name and local context

e.g., “Johnny” tagged as name

e.g., “underwent” tagged as context

e.g., type of surgery. etc.

◼ Made logical relation of predicate and ordered list of 1

argument

◼ Predicates defined by word order, not spacing

◼ Calculated frequency of relations in training set

Taira R, Bui A, Kangarloo H. Identification of patient name references within medical documents. In

Proceedings of the 2002 AMIA Annual Fall Symposium. 2002; 757-761.

19 20

21 22

23 24

Page 5: Data Privacy in Biomedicine Dictionaries and Rules Lecture ... · 1 Data Privacy in Biomedicine Lecture 7: More Scrubbing Bradley Malin, PhD (b.malin@vanderbilt.edu) Professor of

5

© 2020 Bradley Malin 25Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Example of Predicates

Predicate Relative Frequency Example

Patient-healthStatus 0.189 John was doing well

© 2020 Bradley Malin 26Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Example of Predicates

Predicate Relative Frequency Example

Patient-healthStatus 0.189 John was doing well

Patient-age 0.181 John is 3 years old

© 2020 Bradley Malin 27Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Example of Predicates

Predicate Relative Frequency Example

Patient-healthStatus 0.189 John was doing well

Patient-age 0.181 John is 3 years old

Patient-condition 0.140 John developed a fever

Patient-procedure 0.109 John received therapy

Patient-gender 0.108 John is a 5 year-old male

Patient-anaphora 0.102 John is a patient with …

Patient-ADT 0.061 John was discharged

Patient-relative 0.035 John’s mother

Patient-ethnicity 0.028 John is an Asian male

Patient-heightWeight 0.022 John is a chubby male

© 2020 Bradley Malin 28Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Trained Semantic Templates(TBK ‘02)

◼ Algorithm

For each token

◼ If token not excluded***

Locate all possible logical relational constructs relating to

an identifier are associated with the token

For each construct

▪ Determine the probability that the token satisfies the

construct

▪ If the probability > threshold, then predict identifier

© 2020 Bradley Malin 29Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Exclusions

◼ Drug name list (6,200 entries)

◼ Part of physician name based on tokens

e.g., “Dr.”, “M.D.”

◼ Followed by diagnostic qualifier

e.g., “Syndrome”, “Disease”, “Procedure”

◼ Part of department or institution

e.g., “Medical Center”

◼ Associated with article / determiner attachment

© 2020 Bradley Malin 30Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Classification Model(TBK ‘02)

◼ 2-class problem w / maximum entropy model and log-linear basis

◼ Constructs “learned” from training set

Example: John is a 5 year old male with disease X…

Construct: isofAge(John, 5 year old)

Construct: isofGender(John, male)

b = vector of terms associated with construct

fi = “indicator” functions associated with terms, such as word ordering

i = weight associated with feature (from training)

Z = normalization constant (mass over all classes of predictions)

25 26

27 28

29 30

Page 6: Data Privacy in Biomedicine Dictionaries and Rules Lecture ... · 1 Data Privacy in Biomedicine Lecture 7: More Scrubbing Bradley Malin, PhD (b.malin@vanderbilt.edu) Professor of

6

© 2020 Bradley Malin 31Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Alternative Models

◼ Any term trained classifier can be applied

Support vector machines

Naïve Bayes

Boosted Decision Trees

Conditional Random Fields

Recurrent Neural Networks (Deep Learning)

Uzuner, et al. A de-identifier for medical discharge summaries. Artif Intell Med. 2008;42: 13-35.

Wellner, et al. Rapidly retargetable approaches to de-identification in medical records. J Am Med Inform Assoc. 2007;14:564-73.

© 2020 Bradley Malin 32Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Conditional Random Fields (CRFs)

◼ Define the conditional probability of a tag (i.e.,

label) sequence

given an observed set sequence of tokens

is

Wellner B, et al. Rapidly retargetable approaches to de-identification in medical records. Journal of the

American Medical Informatics Association. 2007: 564-573.

Feature function

© 2020 Bradley Malin 33Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Conditional Random Fields (CRFs)

◼ Each feature function is basically a predicate over

a particular configuration of the observation

relative to the current position t, for a particular

label pair at, at-1

◼ Feature weights indicate how strongly the

predicate (over the observations) correlates with a

particular label pair

Wellner B, et al. Rapidly retargetable approaches to de-identification in medical records. Journal of the

American Medical Informatics Association. 2007: 564-573. © 2020 Bradley Malin 34Data Privacy in Biomedicine: Lecture 7 – Scrubbing

CRF Example

◼ Feature for a contextual cue for “Dr.”

An indication we’re about to begin a DOCTOR

phrase

Dt Ba =if

Copy to Dr. Stone , U. BATESSE HOSPITAL

O O O BD

Outside of

a phraseBeginning

of a

“doctor”

phrase

O BH

Beginning

of a

“hospital

phrase

IHIH

Inside or end of

“hospital phrase”

and

Oat =−1

( )=−1tbWORD

and

“Dr”

© 2020 Bradley Malin 35Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Conditional Random Fields (CRFs)

◼ Feature weights (lambdas) come from

maximizing the conditional log likelihoods of the

training data D

◼ Latter term is a penalty to prevent overfitting

◼ Maximization achieved through iterative gradient

descent on the function

◼ Most likely label sequence from Viterbi (dynamic

programming) algorithm

Wellner B, et al. Rapidly retargetable approaches to de-identification in medical records. Journal of the

American Medical Informatics Association. 2007: 564-573.

𝒍𝒊𝒌𝒆𝒍𝒊𝒉𝒐𝒐𝒅𝚲 𝑫 =

𝒂,𝒃

𝒍𝒐𝒈𝑷 𝒂|𝒃 + 𝑹𝝈 𝚲

© 2020 Bradley Malin 36Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Biasing the CRF

◼ Works on the “outside of phrase” scenario

◼ O weight for the corresponding feature

◼ Large negative values → label tokens as identifiers

◼ Large positive values → label tokens as non-identifiers

◼ Tune O using a Gauss-Newton line search [see Machine Learning Course]

◼ Terminate when evaluation results (e.g., recall) differ by a

small amount (e.g., 0.01%)

Minkov E, et al. NER Systems that suit user’s preferences: adjusting the recall-precision tradeoff for entity

extraction. Proceedings of the Human Language Technology Conference of the NAACL. 2006: 93-96.

Oat =if and

0 otherwise

𝒇𝑶 𝒂𝒕, 𝒂𝒕−𝟏, 𝒃, 𝒕 =1

31 32

33 34

35 36

Page 7: Data Privacy in Biomedicine Dictionaries and Rules Lecture ... · 1 Data Privacy in Biomedicine Lecture 7: More Scrubbing Bradley Malin, PhD (b.malin@vanderbilt.edu) Professor of

7

© 2020 Bradley Malin 37Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Bigger Picture of the Process

L. Deleger, et al. Large-scale

evaluation of automated clinical note

de-identification and its impact on

information extraction. J Am Med

Inform Assoc. 2013 Jan 1;20(1):84-94.

© 2020 Bradley Malin 38Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Trained Semantic Templates(TBK ‘02)

◼ Trained system on 1350 pediatric reports

Tagged references for name and local context

◼ e.g., “Johnny” tagged as name

◼ e.g., “underwent” tagged as context

◼ e.g., type of surgery

Taira R, Bui A, Kangarloo H. Identification of patient name references within medical documents. In

Proceedings of the 2002 AMIA Annual Fall Symposium. 2002; 757-761.

© 2020 Bradley Malin 39Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Trained Semantic Templates(TBK ‘02)

◼ Out of 1350 pediatric reports

36% of documents contained name

907 name instances found in non-header

information

◼ Tested with 900 records

ROC of 0.9735

Operating point of 0.55 threshold

◼ 99.2% Precision and 93.9% Recall

Is this good enough?

◼ False Positives

Valid name syntax, but semantically

incorrect

© 2020 Bradley Malin 40Data Privacy in Biomedicine: Lecture 7 – Scrubbing

◼ False Positives

Valid name syntax, but semantically incorrect

◼ “Dear Mark, Robert was in our office today”

Identification of a patient’s relative rather than the patient

◼ “Johnny’s sister Mary is 7 years old”

Patient and physician have same name

Rare use of gender description not describing Patient name

◼ “Tanner 4 female”

Drug names that could not be ruled out

Medical conditions that could not be ruled out

Trained Semantic Templates(TBK ‘02)

© 2020 Bradley Malin 41Data Privacy in Biomedicine: Lecture 7 – Scrubbing

◼ Limitations

False Negatives?

Logical relation not modeled

Grammatically difficult expressions

Only name references, not all HIPAA identifiers

May require retraining for each type of dataset

(pediatric versus cardiology)

May identify false semantic templates in training

Trained Semantic Templates(TBK ‘02)

© 2020 Bradley Malin 42Data Privacy in Biomedicine: Lecture 7 – Scrubbing

(Back to) The AMIA “Bakeoff”

◼ 2006 Natural Language Processing Challenge at

the American Medical Informatics Association

Annual Symposium (AMIA)

◼ 889 records from Partners Healthcare (Boston)

669 for training, 220 for testing

◼ Classes for challenge:

Patients

Doctors

Hospitals

IDs

Dates

Locations

Phone #’s

Ages over 90

O. Uzuner, et al. J Am Med Inform Assoc. 2007 Sep-Oct; 14(5): 550–563

37 38

39 40

41 42

Page 8: Data Privacy in Biomedicine Dictionaries and Rules Lecture ... · 1 Data Privacy in Biomedicine Lecture 7: More Scrubbing Bradley Malin, PhD (b.malin@vanderbilt.edu) Professor of

8

© 2020 Bradley Malin 43Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Performance Measure Levels

◼ Token–level scores: measure performance for each token

◼ Instance–level scores*: Extends the model to account for…

Type: PHI type in the instance (e.g., patient vs. doctor)

Content: terms included in the instance (e.g., “Dr. John Smith”)

Extent: beginning and ending of the instance

*Model used by NIST in their named entity recognition (NER) tasks

© 2020 Bradley Malin 44Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Instance-Level Performance

◼ Instead of true and false

we have correct, incorrect, missing, and spurious

◼ Total number of correct entities

◼ Substitution Error

◼ Insertion Error

◼ Deletion Error

== = otherwise

ccC e

entities

e

e0

correct all areextent & content, type,if1 where,

#

1

== = otherwise

ssS e

entities

e

e0

incorrect isextent & content, type,of one if1 where,

#

1

== = otherwise

iiI e

entities

e

e0

spurious all areextent & content, type,if1 where,

#

1

== = otherwise

ddD e

entities

e

e0

missing all areextent & content, type,if1 where,

#

1

© 2020 Bradley Malin 45Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Examples◼ 12/07

Ground truth = 12 → date; 07 → non-PHI

Predict 12/07 is ID Incorrect type

Predict 12/07 is date

◼ Incorrect type, and

◼ Substitution error

◼ Usually: all partial matches are considered substitution errors

◼ Mendelian Gene

Ground truth = non-PHI

Predict Mendelian Gene is Name

◼ Spurious type, content, and extent → Insertion error

◼ John Smith

Ground truth = name

Predict non-PHI for both

◼ Missing type, content, and extent → Deletion error

© 2020 Bradley Malin 46Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Instance Level Metrics

◼ Instance Level Precision (ILP)

C / (C + S + I)

◼ Instance Level Recall (ILR)

C / (C + S + D)

◼ F-measure = 2* ILP * ILR / (ILP + ILR)

© 2020 Bradley Malin 47Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Significance Testing◼ Null hypothesis: absolute difference in performances of two systems

on your favorite criteria (recall, precision, F) is ~ 0.

◼ Randomization technique: Shuffle a system’s responses to “units” in

the test set N times (e.g., N = 9999).

◼ Create N pairs of pseudo-systems

◼ Count the number of times, n, when difference between the

performances of the pseudo-system pairs is greater than the

difference between the performance of the two actual systems

s = (n + 1) / (N + 1)

◼ If s > threshold, then the difference is explained by chance. Otherwise

difference is significant at the threshold level.

AMIA Test: Threshold set to 0.1; “unit” equals all tokens (or instances) in a record

N. Chinchor. The statistical significance of the MUC-4 Results. In Proceedings of the 4th Conference on

Message Understanding. 1992: 30-50.

© 2020 Bradley Malin 48Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Competition Systems

O. Uzuner, et al. J Am Med Inform Assoc. 2007 Sep-Oct; 14(5): 550–563

43 44

45 46

47 48

Page 9: Data Privacy in Biomedicine Dictionaries and Rules Lecture ... · 1 Data Privacy in Biomedicine Lecture 7: More Scrubbing Bradley Malin, PhD (b.malin@vanderbilt.edu) Professor of

9

© 2020 Bradley Malin 49Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Competition Systems

© 2020 Bradley Malin 50Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Overall

O. Uzuner, et al. J Am Med Inform Assoc. 2007 Sep-Oct; 14(5): 550–563

Rules

© 2020 Bradley Malin 51Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Rules

Instance-Level Comparison

Token-Level Comparison

© 2020 Bradley Malin 52Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Some Observations

◼ Machine learning outperforms rules based

systems

◼ Performance is never optimal

E.g., phone numbers with strange formats

◼ Machine learning can overfit

E.g., training on Mr. Smith / Mrs. Jones, when test set is

John Smith / Jane Jones

© 2020 Bradley Malin 53Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Software: From Theory to Practice

with Conditional Random Fields

HIDE (Gardner & Xiong 2009)

J. Gardner, L. Xiong. An integrated framework for de-identifying unstructured medical data. Data and Knowledge Engineering

(DKE), 2009; 68(12).

© 2020 Bradley Malin 54Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Software: From Theory to Practice

with Conditional Random Fields

HIDE (Gardner & Xiong 2009) MIST (Aberdeen et al 2010)

J. Aberdeen, et al. The MITRE Identification Scrubber Toolkit: design, training, and assessment.. Int J Med Inform.

2010;79(12):849-59.

49 50

51 52

53 54

Page 10: Data Privacy in Biomedicine Dictionaries and Rules Lecture ... · 1 Data Privacy in Biomedicine Lecture 7: More Scrubbing Bradley Malin, PhD (b.malin@vanderbilt.edu) Professor of

10

© 2020 Bradley Malin 55Data Privacy in Biomedicine: Lecture 7 – Scrubbing

MIST Installation & Training

© 2020 Bradley Malin 56Data Privacy in Biomedicine: Lecture 7 – Scrubbing

CRFs (MIST) Beyond AMIA(Aberdeen et al. 2010)

Discharge Laboratory Letter Order All

Train 200 400 200 400 1200

Test 50 100 50 100 300

Precision 0.946 0.905 0.931 0.993 0.943

Recall 0.986 0.966 0.956 0.999 0.978

Precision: 0.91 – 0.99 Recall: 0.95 – 0.99

◼ Vanderbilt’s EMR (No Name or Place Dictionaries invoked)

◼ Specialized version of DE-ID provides the “gold standard”

◼ De-identification model based on CRFs

◼ Four document classes: Discharge Summaries (DS), Letters, Labs, Orders

© 2020 Bradley Malin 57Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Findings With Diverse EMRs

◼ Another alternative: MCRF (Mallet Conditional Random Field) - Cincinnati

Children’s Hospital

◼ Not all types of identifiers are found at the same rate

◼ ~3500 clinical notes over 22 note types > 30,000 identifiers

Virtually

indistinguishable

from human

de-identification

L. Deleger, et al. Large-scale evaluation of automated clinical note de-identification and its impact on information extraction. J

Am Med Inform Assoc. 2013 Jan 1;20(1):84-94.

© 2020 Bradley Malin 58Data Privacy in Biomedicine: Lecture 7 – Scrubbing

◼ CRF Scrubbing @ Cincinnati

◼ ~3500 clinical notes over 22 note types

Negligible Impact on Medication Extraction(Deleger et al. 2013)

Original Notes Scrubbed Notes

Precision 96.3 96.3 – 96.5

Recall 89.3 88.9 – 89.5

F-measure 92.6 92.5 – 92.7

© 2020 Bradley Malin 59Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Can you Trust Synthetic Results?

◼ AMIA challenge used resynthesized data

◼ Do the results hold true on real data?

◼ What are the limits of resynthesis?

© 2020 Bradley Malin 60Data Privacy in Biomedicine: Lecture 7 – Scrubbing

An Experimental Model(Yeniterzi et al, 2010)

◼ OO: de-identification model was trained and tested

with original medical records

replicates ideal training and evaluation

◼ RR: model was trained and tested with resynthesized

medical records

replicates AMIA evaluation

◼ OR: model was trained with original and tested on

resynthesized

replicates ideal training and evaluation

◼ RO: model was trained with resynthesized and tested

with original

replicates “off the shelf” applicationR. Yeniterzi, et al. Effects of personal identifier resynthesis on clinical text de-identification. Journal of the American Medical

Informatics Association. 2010; 17: 59-68.

55 56

57 58

59 60

Page 11: Data Privacy in Biomedicine Dictionaries and Rules Lecture ... · 1 Data Privacy in Biomedicine Lecture 7: More Scrubbing Bradley Malin, PhD (b.malin@vanderbilt.edu) Professor of

11

© 2020 Bradley Malin 61Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Environment(Yeniterzi et al, 2010)

◼ Vanderbilt’s EMR

◼ Specialized version of DE-ID provides the “gold standard”

◼ De-identification model based on CRFs in MIST

◼ Resynthesis is improved model of AMIA (more realistic in replaced terms)

◼ Four document classes: Discharge Summaries (DS), Letters, Labs, Orders

◼ Fifth class uses 50 documents from each class for train, and all test

documents

Record

Class

Evaluation

Train Test

DS 200 50

LETTER 200 50

LAB 400 100

ORDER 400 100

HYBRID 200 300

© 2020 Bradley Malin 62Data Privacy in Biomedicine: Lecture 7 – Scrubbing

A Systemic Analysis(Yeniterzi et al, 2010)

DE-ID+ ReLinking

ResynthesisEngine

MIST ORIG MODEL

MIST RESYNTH MODEL

MITRE Model Builder (Carafe)

OriginalRecords

Annotated OriginalRecords

TRAINOrig

TESTOrig

TESTResynth

TRAINResynth

ResynthesizedRecords

Exp 1

MITRE Model Builder (Carafe)

Exp 2

Exp 3Score O-R

Exp 4

Score O-OScore R-R

Score R-O

© 2020 Bradley Malin 63Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Record ClassRecall Precision F-Measure Accuracy

PHI Exposure

(1-label-blind recall)

OO Experiment

DS 0.986 0.946 0.966 0.993 0.014

LAB 0.966 0.905 0.935 0.983 0.034

LETTER 0.956 0.931 0.944 0.986 0.040

ORDER 0.999 0.993 0.996 0.999 0.001

AGGREGATE 0.978 0.943 0.960 0.990 0.022

HYBRID 0.962 0.925 0.943 0.986 0.035

RR Experiment

DS 0.986 0.972 0.979 0.998 0.010

LAB 0.995 0.991 0.993 0.999 0.005

LETTER 0.965 0.962 0.963 0.996 0.032

ORDER 0.990 0.989 0.989 0.999 0.010

AGGREGATE 0.983 0.977 0.980 0.998 0.014

HYBRID 0.970 0.960 0.965 0.997 0.022

OR Experiment

DS 0.871 0.919 0.894 0.990 0.101

LAB 0.731 0.843 0.783 0.987 0.268

LETTER 0.832 0.910 0.869 0.987 0.155

ORDER 0.788 0.984 0.875 0.992 0.212

AGGREGATE 0.816 0.913 0.862 0.989 0.171

HYBRID 0.842 0.911 0.875 0.990 0.147

RO Experiment

DS 0.674 0.887 0.766 0.961 0.324

LAB 0.348 0.723 0.470 0.899 0.652

LETTER 0.769 0.852 0.808 0.955 0.224

ORDER 0.766 0.834 0.799 0.926 0.234

AGGREGATE 0.642 0.841 0.728 0.942 0.355

HYBRID 0.404 0.789 0.535 0.914 0.592

© 2020 Bradley Malin 64Data Privacy in Biomedicine: Lecture 7 – Scrubbing

Readings for Next Lecture

◼ E. Ratliff. Writer Evan Ratliff tried to vanish: here's what happened. Wired. November 20, 2009.

◼ V. BLue. Strava's fitness heatmaps are a "potential catastrophe". Engadget. February 2, 2018.

Optional

◼ A. Acquisi and R. Gross. Predicting Social Security Numbers from public data. Proceedings of the

National Academy of Sciences USA. 2009; 106(27): 10975-10980.

◼ V. Griffith and M. Jacobsson. Messin' with Texas: Deriving mother's maiden names using public

records. Proceedings of the Applied Cryptography and Network Security Conference. 2005: 91-

103.

◼ S. Munson, et al. Attitudes towards online availability of US public records. Proceedings of the

12th Annual International Digital Government Research Conference. 2011: 2-9.

◼ G. Friedland, et al. Sherlock Holmes' evil twin: on the impact of global inference for online privacy.

Proceedings of the Workshop on New Security Paradigms Workshop. 2011: 105-114.

61 62

63 64