will thompson, phd - clinical text processing with open source

18
Clinical Text Processing with Open Source Software Will Thompson, Ph.D. [email protected] Senior Research Associate Center for Genetic Medicine Northwestern University Contributors: Abel Kho, Eric Just, Jen Pacheco, Arun Muthalagu, Joel Humowiecki March 10, 2011 Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 1 / 16

Upload: amia

Post on 14-Oct-2014

398 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Will Thompson, PhD - Clinical Text Processing With Open Source

Clinical Text Processing with Open Source Software

Will Thompson, Ph.D.

[email protected]

Senior Research AssociateCenter for Genetic Medicine

Northwestern University

Contributors: Abel Kho, Eric Just, Jen Pacheco, Arun Muthalagu, Joel Humowiecki

March 10, 2011

Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 1 / 16

Page 2: Will Thompson, PhD - Clinical Text Processing With Open Source

Northwestern Medicine

NW Memorial Hospital (NMH)NW Medical FacultyFoundation (NMFF)NW Memorial PhysiciansGroup (NMPG)NWU Feinberg School ofMedicine (FSM)

900 inpatient beds, 60Kinpatient admissions/yearOver 600 physicians, 627Koffice visits/year2911 active biomedical IRBstudies at FSM

Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 2 / 16

Page 3: Will Thompson, PhD - Clinical Text Processing With Open Source

EMR Data at Northwestern

The Enterprise DataWarehouse (EDW) is thecentralized repository fordata in the Northwesternmedical campusMission: Collect, integrate,and disseminate data tothe Northwestern campusMicrosoft SQL Serverimplementation with ~10terabytes of data across7,000 tables, including~0.25 terabytes of text

http://edw.northwestern.edu

Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 3 / 16

Page 4: Will Thompson, PhD - Clinical Text Processing With Open Source

Recovering Information from Text

Information ExtractionInformation Extraction turns the unstructured informationembedded in texts into structured data.The structured data that is generated can be viewed as metadataannotations on the original text.These annotations can then be used to populate fields in astructured database.Once information is encoded in this way, we can use it just like anyother structured information in the database, for example for doingresearch or as input to quality improvement tools.

Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 4 / 16

Page 5: Will Thompson, PhD - Clinical Text Processing With Open Source

Text Processing Components

Task ExampleTokenization “The patient has pneumonia.” ⇒

{ The, patient, has, pneumonia, . }Spelling correction pnumonia ⇒ pneumoniaPOS tagging { The, patient, has, pneumonia } ⇒

{ DT, NN, VB, NN }Chunking { DT, NN, VB, NN } ⇒

{ [NP DT NN], VB, [NP NN] }Parsing { DT, NN, VB, NN } ⇒

[S[NP DT NN] [VPVB [NP NN]]]Word sense disambiguation ’PT’ ⇒ (patient) or (physical therapy)Reference resolution “The patient has pneumonia” ⇒ ‘The patient’ =

patient123

Table: Text processing component tasks

Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 5 / 16

Page 6: Will Thompson, PhD - Clinical Text Processing With Open Source

Text Processing Software

Open Source FrameworksUnstructured Infomation Management Architecture (UIMA)General Architecture for Text Engineering (GATE)

Specialized for Clinical TextMedLEE: (Columbia University)MetaMap: (National Library of Medicine)KnowledgeMap and MedEx (Vanderbilt)HITEx: GATE components (I2B2 Project)cTAKES: UIMA components (Mayo Clinic; OHNLP)

Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 6 / 16

Page 7: Will Thompson, PhD - Clinical Text Processing With Open Source

Text Processing Pipelines

Figure: From Hahn et al. (2008)

cTAKES components (Savova et al., 2010b)

Sent. Dectector 7→ Tokenizer 7→ Normalizer 7→ POS Tagger 7→Chunker 7→ NE Recognizer 7→ Negation/Status Marker

Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 7 / 16

Page 8: Will Thompson, PhD - Clinical Text Processing With Open Source

Integrating NLP Output with the EDW

Figure: ETL process for creating a data mart (Just and Thompson, 2010)

Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 8 / 16

Page 9: Will Thompson, PhD - Clinical Text Processing With Open Source

Porting a cTAKES Application

Identifying Peripheral Arterial DiseaseProject of the NHGRI funded electronic MEdical Records andGEnomics (eMERGE) network (www.gwas.net).Project goals – leverage the EMR to discover genetic variantsinfluencing susceptibility to PAD.Mayo Clinic developed an algorithm that combines informationfrom both structured data and information extracted from radiologyreports to classify patients as PAD positive or negative (Kulloet al., 2010).NLP component implemented as a cTAKES (UIMA) PAD pipeline(Savova et al., 2010a)

Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 9 / 16

Page 10: Will Thompson, PhD - Clinical Text Processing With Open Source

PAD Dataset

The output of text processing was used to perform document-levelclassification of radiology reports (Savova et al., 2010a).

Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 10 / 16

Page 11: Will Thompson, PhD - Clinical Text Processing With Open Source

Contribution of NLP to Results

Porting the Algorithm to NorthwesternDEFINITE cases defined as two or more criteria are true.PROBABLE cases defined as one criterion is true.Adding NLP to the set of criteria resulted in a 40% increase in thenumber of DEFINITE PAD cases, and a 20% increase in thenumber of PROBABLE PAD cases.On 100 chart reviewed patients, we achieved PPV = .95,NPV = 1.0.

Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 11 / 16

Page 12: Will Thompson, PhD - Clinical Text Processing With Open Source

Porting a HITEX Application

Identifying Rheumatoid ArthritisProject of the NIH funded Pharmacogenomics Research Network(PGRN)(http://www.nigms.nih.gov/Initiatives/PGRN).Project goals – leverage the EMR to discover genetic variantsinfluencing treatment outcomes for rheumatoid arthritis.Partners Healthcare System developed a logistic regressionalgorithm that combines information from both structured data andinformation extracted from various types of clinical notes toclassify patients as RA positive or negative.NLP component implemented as a HITEx (GATE) pipeline (Zenget al., 2006)

Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 12 / 16

Page 13: Will Thompson, PhD - Clinical Text Processing With Open Source

RA Dataset

Sample variables from both structured and textual sources. Data from6,126 patients, including approximately 156,000 clinical notes.

Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 13 / 16

Page 14: Will Thompson, PhD - Clinical Text Processing With Open Source

Contribution of NLP to the Results

Comparion of classification algorithms (Liao et al., 2010). NLP +Structured data performed best.

Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 14 / 16

Page 15: Will Thompson, PhD - Clinical Text Processing With Open Source

Conclusion

Key MessageWith the release of frameworks like cTAKES and HITEx, open-sourceclinical text processing is easier than ever.

Lessons LearnedAdding NLP to phenotype identification algorithms candramatically improve both precision and recall over using juststructured codes alone.Knowledge of Java is essential for porting UIMA and GATE clinicaltext processing pipelines to a new institution.Knowledge of how NLP component tasks work is useful forperformance optimization.

Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 15 / 16

Page 16: Will Thompson, PhD - Clinical Text Processing With Open Source

Questions?

Will [email protected]

Northwestern UniversityFeinberg School of Medicine

Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 16 / 16

Page 17: Will Thompson, PhD - Clinical Text Processing With Open Source

U. Hahn, E. Buyko, R. Landefeld, M. Mühlhausen, M. Poprat,K. Tomanek, and J. Wermter. An overview of JCoRe, the JULIE LabUIMA component repository. In Proceedings of the LREC’08Workshop, Towards Enhanced Interoperability for Large HLTSystems: UIMA for NLP, Marrakech, Morocco., 2008.

Eric Just and Will Thompson. Implementing a workflow for clinical textin an enterprise data warehouse. In Proceedings of the AmericanMedical Informatics Association Annual Symposium, 2010.

Iftikhar J Kullo, Jin Fan, Jyotishman Pathak, Guergana K Savova,Zeenat Ali, and Christopher G Chute. Leveraging informatics forgenetic studies: use of the electronic medical record to enable agenome-wide association study of peripheral arterial disease.Journal of the American Medical Informatics Association, 17:586–574, 2010.

Katherine P. Liao, Tianxi Cai, Vivian Gainer, Sergey Goryachev, QuingZeller-Treitler, Soumya Raychaudhuri, Peter Szolovits, SusanneChurchill, Shawn Murphy, Isaac Kohane, Elizabeth Karlson, andRobert M. Plenge. Electronic medical records for discovery research

Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 16 / 16

Page 18: Will Thompson, PhD - Clinical Text Processing With Open Source

in rheumatoid arthritis. Arthritis Care and Research, 62(8):1120–1127, Augus 2010.

Guergana K Savova, Jin Fan, Zi Ye, Sean Murphy, Jiaping Zheng,Christopher G Chute, and Iftikhar J Kullo. Discovering peripheralarterial disease cases from radiology notes using natural languageprocessing. In Proceedings of the American Medical InformaticsAssociation Symposium, 2010a.

Guergana K Savova, James J Masanz, Philip V Ogren, Jiaping Zheng,Sunghwan Sohn, Karin C Kipper-Schuler, and Christopher G Chute.Mayo clinical text analysis and knowledge extraction system(ctakes): architecture, component evaluation and applications.Journal of the American Medical Informatics Association, 17(5):507–513, 2010b.

QT Zeng, S. Goryachev, S. Weiss, M. Sordo, SN Murphy, andR Lazarus. Extracting principal diagnosis, co-morbidity and smokingstatus for asthma research: evaluation of a natural languageprocessing system. BMC Medical Informatics and Decision Making,6(30), 2006.

Will Thompson, Ph.D. (Northwestern University) Clinical Text Processing 16 / 16