automac)detec’on)of) inconsistencies)between) free)textand ... · ibm research - haifa...
Post on 17-Feb-2019
219 Views
Preview:
TRANSCRIPT
Automa'c Detec'on of Inconsistencies between Free Text and Coded Data in Sarcoma Discharge Le=ers Ruty Rino)a,, Michele Torresanib, Rossella Bertullib, Abigail Goldsteena, Paolo Casalib, Boaz Carmelia And Noam Slonima
a IBM Haifa Research Labs, 165 Aba Hushi st., Haifa 31905, Israel bFondazione IRCCS -‐ IsKtuto Nazionale dei Tumori, via Venezian, 1, Milano, Italy
Oral Presentation MIE
Pisa, August 2012
IBM Research - Haifa
Free Text Vs. Coded Fields in EHRs
§ Data in Electronic Health Records (EHR) can be stored in free text or coded fields
§ Coded fields are useful for querying, mining, analyzing and sharing data
§ Free text has more expressive power, ease of use
Diagnosis
myxoid liposarcoma of the right arm, locally extended, with areas cellulate> 5%, 7.0 cm in diameter.
ICDO-T Connective, soft tissues of upper limb
C4.2
Patient Name John Doe
Gender Male
Electronic Health Record
IBM Research - Haifa
Free Text Vs. Coded Fields in EHRs
§ ONen both coded and free text fields are used to store the same type of informa'on § Diagnosis § Treatment § …. Diagnosis
myxoid liposarcoma of the right foot, locally extended, with areas cellulate> 5%, 7.0 cm in diameter.
ICDO-T Connective, soft tissues of upper limb
C4.2
Patient Name John Doe
Gender Male
Electronic Health Record
§ This enables discordance in EHR data
§ May have devasta'ng effects on pa'ent care
§ Singh et. al 2009: § Of 56,000 prescrip'ons, 1% contained inconsistencies. § 20% of errors could have caused moderate to severe harm
IBM Research - Haifa
Automa'c inconsistencies detec'on
§ Previous works used extensive manual work by domain experts to ascertain such inconsistencies are prevalent (Singh et. al, Stein et. al)
§ We suggest an automa'c method to iden'fy poten'al inconsistencies between a coded field and free text field(s) that store overlapping informa'on.
Diagnosis
myxoid liposarcoma of the right foot, locally extended, with areas cellulate> 5%, 7.0 cm in diameter.
ICDO-T Connective, soft tissues of upper limb
C4.2
Patient Name John Doe
Gender Male
Electronic Health Record
Diagnosis
myxoid liposarcoma of the right foot, locally extended, with areas cellulate> 5%, 7.0 cm in diameter.
ICDO-T Connective, soft tissues of upper limb
C4.2
Patient Name John Doe
Gender Male
Electronic Health Record
Diagnosis
myxoid liposarcoma of the right foot, locally extended, with areas cellulate> 5%, 7.0 cm in diameter.
ICDO-T Connective, soft tissues of upper limb
C4.2
Patient Name John Doe
Gender Male
Electronic Health Record
IBM Research - Haifa
Solu'on outline
Blah blah blah, yadda yadda yadda Code A
C3.1 Connective, soft tissues of lower limb
Diagnosis
ICDO-T Connective, soft tissues of upper limb
C4.2
Patient Name John Doe
Gender Male
Electronic Health Record
C4.2 Connective, soft tissues of upper limb
myxoid liposarcoma of the right foot, locally extended, with areas cellulate> 5%, 7.0 cm in diameter.
myxoid liposarcoma of the right foot, locally extended, with areas cellulate> 5%, 7.0 cm in diameter.
Poten'al inconsistency!
§ Train a Machine Learning classifier that can predict the most expected code based on the free text data.
§ To determine if record x has inconsistencies: § Use the classifier to predict code for record X § Compare predic'on with real code.
§ Final decision if disagreement is a result of inconsistency or classifica'on mistake remains in the hands of domain expert
§ By highligh'ng poten'al inconsistencies, number of records she needs to examine is drama'cally reduced.
IBM Research - Haifa
How to train a classifier
§ Training a classifier requires “training data” – “good examples” of what you want your classifier to learn
Sept 07: appearance of lesions at a distance to the right thigh. Code C476
November 08: Wide demolition loggia anteromedial thigh block with right superficial femoral vein.
Code C523
1 2
appearance of metastases in the right buttock. Code C476 N
.
.
.
New text Code Text classifier
IBM Research - Haifa
How to train a classifier? § Use medical records which we wish to examine as training data. § Assump'on -‐ In most records free text agrees with coded data § Note – some frac'on of training data will have mismatched
codes (inconsistent records) § Overcome by 2 rounds of training
Text classifier
Diagnosis myxoid liposarcoma of the right foot, locally extended, with areas cellulate> 5%, 7.0 cm in diameter. ICDO-T Connective, soft tissues of upper limb
C4.2
Patient Name
John Doe Gender Male
Electronic Health Record
Diagnosis myxoid liposarcoma of the right foot, locally extended, with areas cellulate> 5%, 7.0 cm in diameter. ICDO-T Connective, soft tissues of upper limb
C4.2
Patient Name
John Doe Gender Male
Electronic Health Record
Diagnosis myxoid liposarcoma of the right foot, locally extended, with areas cellulate> 5%, 7.0 cm in diameter. ICDO-T Connective, soft tissues of upper limb
C4.2
Patient Name
John Doe Gender Male
Electronic Health Record
Diagnosis
myxoid liposarcoma of the right foot, locally extended, with areas cellulate> 5%, 7.0 cm in diameter.
ICDO-T Connective, soft tissues of upper limb
C4.2
Patient John Doe
Gender Male
Electronic Health Record
Train
Code C49 Code C49 Code C49 Code C49
Classify Compare & Filter Re-Train
IBM Research - Haifa
Ensemble Learning
§ How to improve classifica'on performance?
§ Get a second opinion! § Work with an ensemble of classifiers
§ Naïve Bayes (NB) § K-‐Nearest Neighbors (KNN) § Mul'-‐Class decision trees (MDT)
§ High recall – require at least one classifier to disagree with EHR code
§ High precision – require all classifiers to disagree with EHR code and agree with one-‐another
§ In the following results we required high precision
Code C49
Code C49
Code in EHR Classifica'on
NB Code C35 KNN Code C49 MDT
Code C35
Code C35
IBM Research - Haifa
Data § Anonymized discharge le=ers of SoN Tissue Sarcoma pa'ents treated at
Italian Na'onal Cancer Ins'tute in Milan (INT). § 734 discharge le=ers spanning 456 treatment programs. § Part of a work on the Cli-‐G decision support system (Wed. 1000 Fermi)
Coded Field Free text field(s) # of instances
# of distinct words
# of distinct codes
Presentation (clinical status)
• Presentation text • Disease extension • Clinical Summary
261 2967 2
ICDO-T (Primary anatomic site)
• Disease extension • Diagnostic text • Oncological history
410 3792 15
ICDO-M (Morphology) • Diagnostic text 435 385 11
Treatment program (TP)
• Treatment • Treatment program 128 633 8
RECIST • Clinical Summary 218 1406 5
IBM Research - Haifa
Results
Coded Field Precision ensemble
Recall ensemble
Presenta'on 0.98 0.77
ICDO-‐T 0.93 0.54
ICDO-‐M 0.96 0.73
Treatment program
0.64 0.34
RECIST 0.83 0.36
§ Classifica'on Results
IBM Research - Haifa
Results
Coded Field Cases predicted as inconsistent True
Not enough
info
Presenta'on 5 3 0
ICDO-‐T 17 5 0
ICDO-‐M 14 6 0
TP 18 15 0
RECIST 16 4 7
Precision using top
50% of cases
0.67
0. 57
0.86
0.75
0.57
Precision
0.67
0.29
0.43
0.83
0.44
§ Manual valida'on of inconsistencies
IBM Research - Haifa
Summary
§ Automa'c method to highlight poten'al inconsistencies between free text and coded fields by classifica'on
§ Can be used for retrospec've correc'on of mistakes – requires valida'on by domain expert
§ Can be used for online detec'on – draw clinicians a=en'on to poten'al mistakes as she is filling in the record
IBM Research - Haifa
Acknowledgements
§ Cli-‐G team (IBM) § Noam Slonim § Abigail Goldsteen § Boaz Carmeli
§ Is'tuto Nazionale dei Tumori § Michele Torresani § Dr. Rossella Bertulli § Dr. Paolo Casali
IBM Research - Haifa
How to Represent Free Text?
§ For classifica'on need numerical representa'on of text.
§ Bag of Words – popular representa'on.
§ Simple to use § Does not preserve rela'ons between words.
Leg amputation for sarcoma of the right foot, locally extended. myxoid liposarcoma, with areas cellulate> 5%, 7.0 cm in diameter. Appearance of lesions at a distance to the right thigh and the pelvic. Start new chemotherapy Trabectedin (ET-743).
d
0 00 01 0 0 0 0 0 0 0 0 1 0BOW(d)
IBM Research - Haifa
Results
Coded Field Cases predicted as inconsistent True
Not enough
info ICDO-‐M 14 6 0
Precision
0.43
Precision using top
50% of cases
0.86
§ Coding version used at the 'me had no ICDO-‐M code for Fibromyxosarcoma § Out of 26 cases with Fibromyxosarcoma related diagnosis, only 6 used the
correct code -‐ “Sarcoma, not otherwise specified” § The other 20 used the code “Malignant fibrous hisKocytoma” § As a result, the classifiers incorrectly learned to classify Fibromyxosarcoma as
“Malignant fibrous hisKocytoma”, leading to 6 wrong inconsistency iden'fica'ons
§ Pathological case where mistake is more common than correct code § Code for Fibromyxosarcoma was added in last version.
IBM Research - Haifa
ER
ER
Classifier I Data (Xi)
ER Free Text
Code
Classifier I Classifier
Label(Yi)
Prediction iYPrediction iY
Prediction & confidence
ii CY ,ˆ
Mark potentially mislabeled instance
ER
ER Classifier
I Data (X)
Labels (Y)
ER
Free Text
Code
Classifier I Classifier
ER
ER Classifier
I Data (X)
Labels (Y)
ER
Free Text
Code
Classifier I Classifier
Filtered training data
IBM Research - Haifa
Results
Coded Field Precision best method
Precision ensemble
Recall ensemble
Presenta'on 0.95 (DT) 0.98 0.77
ICDO-‐T 0.74 (NB) 0.93 0.54
ICDO-‐M 0.91 (DT) 0.96 0.73
Treatment program
0.59 (NB) 0.64 0.34
RECIST 0.60 (DT) 0.83 0.36
§ Classifica'on Results
IBM Research - Haifa
Machine Learning 101
§ Learning -‐ any process by which a system improves performance from experience.” (Herbert Simon)
§ Task – Classify text into correct code § Performance – % of correctly classified texts § Experience – Training data –texts and their matching codes (labels)
Sept 07: appearance of lesions at a distance to the right thigh. Code C476
November 08: Wide demolition loggia anteromedial thigh block with right superficial femoral vein.
Code C523
1 2
appearance of metastases in the right buttock. Code C476 N
.
.
.
New text Code Text classifier
top related