will thompson, phd - clinical text processing with open source
TRANSCRIPT
Clinical Text Processing with Open Source Software
Will Thompson, Ph.D.
Senior Research AssociateCenter for Genetic Medicine
Northwestern University
Contributors: Abel Kho, Eric Just, Jen Pacheco, Arun Muthalagu, Joel Humowiecki
March 10, 2011
Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 1 / 16
Northwestern Medicine
NW Memorial Hospital (NMH)NW Medical FacultyFoundation (NMFF)NW Memorial PhysiciansGroup (NMPG)NWU Feinberg School ofMedicine (FSM)
900 inpatient beds, 60Kinpatient admissions/yearOver 600 physicians, 627Koffice visits/year2911 active biomedical IRBstudies at FSM
Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 2 / 16
EMR Data at Northwestern
The Enterprise DataWarehouse (EDW) is thecentralized repository fordata in the Northwesternmedical campusMission: Collect, integrate,and disseminate data tothe Northwestern campusMicrosoft SQL Serverimplementation with ~10terabytes of data across7,000 tables, including~0.25 terabytes of text
http://edw.northwestern.edu
Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 3 / 16
Recovering Information from Text
Information ExtractionInformation Extraction turns the unstructured informationembedded in texts into structured data.The structured data that is generated can be viewed as metadataannotations on the original text.These annotations can then be used to populate fields in astructured database.Once information is encoded in this way, we can use it just like anyother structured information in the database, for example for doingresearch or as input to quality improvement tools.
Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 4 / 16
Text Processing Components
Task ExampleTokenization “The patient has pneumonia.” ⇒
{ The, patient, has, pneumonia, . }Spelling correction pnumonia ⇒ pneumoniaPOS tagging { The, patient, has, pneumonia } ⇒
{ DT, NN, VB, NN }Chunking { DT, NN, VB, NN } ⇒
{ [NP DT NN], VB, [NP NN] }Parsing { DT, NN, VB, NN } ⇒
[S[NP DT NN] [VPVB [NP NN]]]Word sense disambiguation ’PT’ ⇒ (patient) or (physical therapy)Reference resolution “The patient has pneumonia” ⇒ ‘The patient’ =
patient123
Table: Text processing component tasks
Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 5 / 16
Text Processing Software
Open Source FrameworksUnstructured Infomation Management Architecture (UIMA)General Architecture for Text Engineering (GATE)
Specialized for Clinical TextMedLEE: (Columbia University)MetaMap: (National Library of Medicine)KnowledgeMap and MedEx (Vanderbilt)HITEx: GATE components (I2B2 Project)cTAKES: UIMA components (Mayo Clinic; OHNLP)
Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 6 / 16
Text Processing Pipelines
Figure: From Hahn et al. (2008)
cTAKES components (Savova et al., 2010b)
Sent. Dectector 7→ Tokenizer 7→ Normalizer 7→ POS Tagger 7→Chunker 7→ NE Recognizer 7→ Negation/Status Marker
Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 7 / 16
Integrating NLP Output with the EDW
Figure: ETL process for creating a data mart (Just and Thompson, 2010)
Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 8 / 16
Porting a cTAKES Application
Identifying Peripheral Arterial DiseaseProject of the NHGRI funded electronic MEdical Records andGEnomics (eMERGE) network (www.gwas.net).Project goals – leverage the EMR to discover genetic variantsinfluencing susceptibility to PAD.Mayo Clinic developed an algorithm that combines informationfrom both structured data and information extracted from radiologyreports to classify patients as PAD positive or negative (Kulloet al., 2010).NLP component implemented as a cTAKES (UIMA) PAD pipeline(Savova et al., 2010a)
Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 9 / 16
PAD Dataset
The output of text processing was used to perform document-levelclassification of radiology reports (Savova et al., 2010a).
Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 10 / 16
Contribution of NLP to Results
Porting the Algorithm to NorthwesternDEFINITE cases defined as two or more criteria are true.PROBABLE cases defined as one criterion is true.Adding NLP to the set of criteria resulted in a 40% increase in thenumber of DEFINITE PAD cases, and a 20% increase in thenumber of PROBABLE PAD cases.On 100 chart reviewed patients, we achieved PPV = .95,NPV = 1.0.
Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 11 / 16
Porting a HITEX Application
Identifying Rheumatoid ArthritisProject of the NIH funded Pharmacogenomics Research Network(PGRN)(http://www.nigms.nih.gov/Initiatives/PGRN).Project goals – leverage the EMR to discover genetic variantsinfluencing treatment outcomes for rheumatoid arthritis.Partners Healthcare System developed a logistic regressionalgorithm that combines information from both structured data andinformation extracted from various types of clinical notes toclassify patients as RA positive or negative.NLP component implemented as a HITEx (GATE) pipeline (Zenget al., 2006)
Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 12 / 16
RA Dataset
Sample variables from both structured and textual sources. Data from6,126 patients, including approximately 156,000 clinical notes.
Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 13 / 16
Contribution of NLP to the Results
Comparion of classification algorithms (Liao et al., 2010). NLP +Structured data performed best.
Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 14 / 16
Conclusion
Key MessageWith the release of frameworks like cTAKES and HITEx, open-sourceclinical text processing is easier than ever.
Lessons LearnedAdding NLP to phenotype identification algorithms candramatically improve both precision and recall over using juststructured codes alone.Knowledge of Java is essential for porting UIMA and GATE clinicaltext processing pipelines to a new institution.Knowledge of how NLP component tasks work is useful forperformance optimization.
Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 15 / 16
Questions?
Will [email protected]
Northwestern UniversityFeinberg School of Medicine
Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 16 / 16
U. Hahn, E. Buyko, R. Landefeld, M. Mühlhausen, M. Poprat,K. Tomanek, and J. Wermter. An overview of JCoRe, the JULIE LabUIMA component repository. In Proceedings of the LREC’08Workshop, Towards Enhanced Interoperability for Large HLTSystems: UIMA for NLP, Marrakech, Morocco., 2008.
Eric Just and Will Thompson. Implementing a workflow for clinical textin an enterprise data warehouse. In Proceedings of the AmericanMedical Informatics Association Annual Symposium, 2010.
Iftikhar J Kullo, Jin Fan, Jyotishman Pathak, Guergana K Savova,Zeenat Ali, and Christopher G Chute. Leveraging informatics forgenetic studies: use of the electronic medical record to enable agenome-wide association study of peripheral arterial disease.Journal of the American Medical Informatics Association, 17:586–574, 2010.
Katherine P. Liao, Tianxi Cai, Vivian Gainer, Sergey Goryachev, QuingZeller-Treitler, Soumya Raychaudhuri, Peter Szolovits, SusanneChurchill, Shawn Murphy, Isaac Kohane, Elizabeth Karlson, andRobert M. Plenge. Electronic medical records for discovery research
Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 16 / 16
in rheumatoid arthritis. Arthritis Care and Research, 62(8):1120–1127, Augus 2010.
Guergana K Savova, Jin Fan, Zi Ye, Sean Murphy, Jiaping Zheng,Christopher G Chute, and Iftikhar J Kullo. Discovering peripheralarterial disease cases from radiology notes using natural languageprocessing. In Proceedings of the American Medical InformaticsAssociation Symposium, 2010a.
Guergana K Savova, James J Masanz, Philip V Ogren, Jiaping Zheng,Sunghwan Sohn, Karin C Kipper-Schuler, and Christopher G Chute.Mayo clinical text analysis and knowledge extraction system(ctakes): architecture, component evaluation and applications.Journal of the American Medical Informatics Association, 17(5):507–513, 2010b.
QT Zeng, S. Goryachev, S. Weiss, M. Sordo, SN Murphy, andR Lazarus. Extracting principal diagnosis, co-morbidity and smokingstatus for asthma research: evaluation of a natural languageprocessing system. BMC Medical Informatics and Decision Making,6(30), 2006.
Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 16 / 16