henk harkema andrea setzer ian roberts rob gaizauskas mark hepple university of sheffield jeremy...

23
Henk Harkema Andrea Setzer Ian Roberts Rob Gaizauskas Mark Hepple University of Sheffield Jeremy Rogers University of Manchester Richard Power Open University Extraction and Analysis of Information from Structured and Unstructured Clinical Records AHM 2005 Text Mining Workshop 29/9/5

Upload: eunice-adams

Post on 12-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Henk Harkema Andrea Setzer Ian Roberts Rob Gaizauskas Mark Hepple University of Sheffield Jeremy Rogers University of Manchester Richard Power Open University

Henk HarkemaAndrea SetzerIan RobertsRob GaizauskasMark HeppleUniversity of Sheffield

Jeremy RogersUniversity of Manchester

Richard PowerOpen University

Extraction and Analysis of Information from Structured and

Unstructured Clinical Records

AHM 2005 Text Mining Workshop 29/9/5

Page 2: Henk Harkema Andrea Setzer Ian Roberts Rob Gaizauskas Mark Hepple University of Sheffield Jeremy Rogers University of Manchester Richard Power Open University

2

Overview

• Background

• Information Extraction

• Information Integration

Page 3: Henk Harkema Andrea Setzer Ian Roberts Rob Gaizauskas Mark Hepple University of Sheffield Jeremy Rogers University of Manchester Richard Power Open University

3

Background: CLEF

• Clinical e-Science Framework

• Objective:• To develop a high quality, secure and interoperable information

repository, derived from operational electronic patient records to enable ethical and user-friendly access to patient information in support of clinical care and biomedical research

• Duration, funding, participants:• 2003 – 2005 (CLEF), 2005 – 2007 (CLEF-Services)

• Funded by Medical Research Council (MRC)

• Six universities, Royal Marsden Hospital, industrial partners engaged through CLEF Industrial Forum Meetings

Page 4: Henk Harkema Andrea Setzer Ian Roberts Rob Gaizauskas Mark Hepple University of Sheffield Jeremy Rogers University of Manchester Richard Power Open University

4

Sheffield NLP & CLEF

• Information Extraction• Analyzing clinical narratives to extract medically relevant

entities and events, and their properties and relationships

• Information Importation• Importing extracted information into the CLEF repository

• Information Integration• Combining extracted information with structured information

(i.e., non-narrative data) already in repository in order to build summary of patient’s conditions and treatment over time

Page 5: Henk Harkema Andrea Setzer Ian Roberts Rob Gaizauskas Mark Hepple University of Sheffield Jeremy Rogers University of Manchester Richard Power Open University

5

Medical IE

• Standard Information Extraction tasks:• Entity/event extraction & relationship extraction

• Additional challenges:• Cross-document event co-reference

• Same event mentioned in multiple documents; many documentsprovide only partial descriptions of events

• Modality of Information

• Negation: “I cannot feel any lump in her right supraclavicular fossa”

• Uncertainty: “I just wonder if there is an outside possibility that she might have mediastinal fibrosis to account for her symptomology”

• Temporality of Information

Page 6: Henk Harkema Andrea Setzer Ian Roberts Rob Gaizauskas Mark Hepple University of Sheffield Jeremy Rogers University of Manchester Richard Power Open University

6

Entities, Events & Relationships• Entities, events:

• Problem: melanoma, swelling, …• Present/absent

• Clinical course: getting worse, getting better, no change

• Intervention: amputation, chemotherapy, …• Status: planned, booked, started, completed, …

• Investigation: CT scan, ultrasound, …• Status: planned, booked, started, completed, …

• Goal: treat, cure, palliate

• Drug: Atenolol, antibiotics, …

• Locus: abdomen, blood, …• Laterality: left, right

Page 7: Henk Harkema Andrea Setzer Ian Roberts Rob Gaizauskas Mark Hepple University of Sheffield Jeremy Rogers University of Manchester Richard Power Open University

7

Entities, Events & Relationships

• Relationships:• Location of problem: problem locus

• hip pain

• lesions in her liver

• Finding of investigation: investigation problem

• An ECG examination revealed atrial fibrillation

• CT scan of her thorax and abdomen shows progressive disease

• Target of intervention: intervention locus

• radiotherapy to back

• breast radiotherapy

• Further relationships

Page 8: Henk Harkema Andrea Setzer Ian Roberts Rob Gaizauskas Mark Hepple University of Sheffield Jeremy Rogers University of Manchester Richard Power Open University

8

IE Approach

• Pipeline of processing modules• Pre-processing:

• Tokenization, sentence splitting

• Lexical & terminological processing:

• Morphological analysis, term look-up, term parsing

• Syntactic & semantic processing:

• Sentence-based syntactic, semantic analysis

• Discourse processing & IE pattern application:

• Integration of semantic representations into discourse model

• Application of patterns to collect information to be extracted

Page 9: Henk Harkema Andrea Setzer Ian Roberts Rob Gaizauskas Mark Hepple University of Sheffield Jeremy Rogers University of Manchester Richard Power Open University

9

Terminology Processing

• Termino: a large-scale terminological resource to support term processing for information extraction, retrieval, and navigation

• Termino contains a database holding large numbersof terms imported from various existing terminological resources, including UMLS

• Efficient recognition of terms in text is achieved through use of finite state recognizers compiled from contents of database

• The results of lexical look-up in Termino can feed into further term processing components, e.g., term parser

Page 10: Henk Harkema Andrea Setzer Ian Roberts Rob Gaizauskas Mark Hepple University of Sheffield Jeremy Rogers University of Manchester Richard Power Open University

10

Terminology Processing

• Termino for CLEF• Imported 160,000 terms from UMLS drawn from semantic

types such as pharmacologic substances, anatomical structures, therapeutic procedures, diagnostic procedures, …

• Term grammars• Rules for combining terms identified by term look-up in Termino

into longer terms

• Example: locations in the lung

Termino

location_np latitude_adj area_nounlatitude_adj: upper, middle, lower, mid, basalarea_noun: zone, region, area, field, lung, lobe

Page 11: Henk Harkema Andrea Setzer Ian Roberts Rob Gaizauskas Mark Hepple University of Sheffield Jeremy Rogers University of Manchester Richard Power Open University

11

Information Extraction Patterns

• IE patterns inspect syntactic and semanticanalyses and assert properties of entities and relationships between entities

• Example: finding of investigation• “CT scan of her thorax shows progressive disease”

• IE pattern: invest_finding(I, P) ifinvestigation(I),problem(P),show_event(S),lsubj(S, I),lobj(S, P).

Page 12: Henk Harkema Andrea Setzer Ian Roberts Rob Gaizauskas Mark Hepple University of Sheffield Jeremy Rogers University of Manchester Richard Power Open University

12

Information Extraction Patterns

• Finding patterns• Hand-crafted patterns

• “Redundancy” approach:

• given a patient for whom a relationship between two particular entities is known to exist (e.g., we know patienthas a tumor in his lung), …

• find all sentences in all notes of this patient that contain these two entities, …

• and assume these sentences express the same relationship

Page 13: Henk Harkema Andrea Setzer Ian Roberts Rob Gaizauskas Mark Hepple University of Sheffield Jeremy Rogers University of Manchester Richard Power Open University

13

Information Integration

• Combining structured information in repository with information extracted from narratives into coherent overview of patient’s condition and treatment over time

• Issues in Information Integration:• Ambiguity: given an event extracted from a narrative, to which

event in the structured data does it correspond?

• Fragmentation & duplication: Information Extraction over narrative data produces collection of potentially fragmented and duplicated descriptions of medical events which need to be sorted out

• Investigation of contribution of temporal information found within narratives to Information Integration

Page 14: Henk Harkema Andrea Setzer Ian Roberts Rob Gaizauskas Mark Hepple University of Sheffield Jeremy Rogers University of Manchester Richard Power Open University

14

Linking extracted and structured events

• Reduce ambiguity through use of:• Medical information: type of event, relationships, …• Temporal information: time stamps, temporal expressions,

verbal tense & aspect, …

Type: X-RAYLocation: chestDate: 2000-05-23

Type: X-RAYLocation: chestDate: 2000-05-26

Type: MRILocation: abdomenDate: 2000-05-23

Type: X-RAYLocation: chestDate: 2000-07-19

1 32 4

Chest X-RAY arranged for next week.2000-05-16

The chest X-RAY performed …2000-05-24

1 2

Events in structured data

Events in narratives

Page 15: Henk Harkema Andrea Setzer Ian Roberts Rob Gaizauskas Mark Hepple University of Sheffield Jeremy Rogers University of Manchester Richard Power Open University

15

Constraint Satisfaction

• Ambiguity reduction as a Constraint Satisfaction problem

• Each narrative event is associated with a time domain, i.e., setof possible dates on which event could have taken place

• Temporal and medical information extracted from narratives is formulated as set of constraints on time domain of narrative event

• Use Constraint Logic Programming tools to resolve time domains of narrative events

• If resolved time domain of narrative event contains date of structured event, link narrative event to structured event

Page 16: Henk Harkema Andrea Setzer Ian Roberts Rob Gaizauskas Mark Hepple University of Sheffield Jeremy Rogers University of Manchester Richard Power Open University

16

Evaluation

• Evaluation of effectiveness of temporal constraints in Information Integration• Link each narrative event to set of potentially matching events

of same type in structured data according to medical constraints

• Measure how well application of temporal constraints narrowdown this initial set of “structured” candidates

• We used a semi-automated pipeline to produce an idealised version of what a fully automatic system would provide as the input to the CSP component

• Results must be viewed in the light of the idealised input

Page 17: Henk Harkema Andrea Setzer Ian Roberts Rob Gaizauskas Mark Hepple University of Sheffield Jeremy Rogers University of Manchester Richard Power Open University

17

Data and Gold Standard

• Confined to investigation events

• Patient notes of 5 patients analysed and annotated (large overhead of manual annotation)

• 446 documents, of which 94 contain 152 investigation events

• Manually created Gold Standard linking each narrative event to structured events of the same type, and correct targets

Page 18: Henk Harkema Andrea Setzer Ian Roberts Rob Gaizauskas Mark Hepple University of Sheffield Jeremy Rogers University of Manchester Richard Power Open University

18

Annotating Temporal Information

• We annotate times, events (i.e., investigations)and temporal relations holding between these

• The annotation scheme used is a subset of the TimeML annotation scheme

• Example:

We have arranged an MRI scan for next week.

during

Page 19: Henk Harkema Andrea Setzer Ian Roberts Rob Gaizauskas Mark Hepple University of Sheffield Jeremy Rogers University of Manchester Richard Power Open University

19

Evaluation: Recall & Precision

• We want to quantify the impact of using temporal constraints to reduce the ambiguity of mapping narrative events to structured events

• Ideally, temporal constraints should greatly reduce ambiguity by eliminating incorrect candidates from the set of possible targets in structured data – but not eliminate the true target

• Global evaluation measures:

• Recall: proportion of correct targets recognised as possible targets

• Precision: proportion of recognised possible targets that are correct

• We applied both metrics before and after application of temporal constraints in CSP and compared the results

Page 20: Henk Harkema Andrea Setzer Ian Roberts Rob Gaizauskas Mark Hepple University of Sheffield Jeremy Rogers University of Manchester Richard Power Open University

20

Evaluation: Strict & Liberal Accuracy

• The limitation of the Recall and Precision metrics is that they score for the overall data set – i.e. over all events for all 5 patients

• If even only a small number of events retain a large number of possible targets, the overall precision score will be low even though most events are close to being correctly resolved

• Consequently, we developed two “accuracy” based scores (liberal and strict), which quantify for each narrative event the extent to which it is correctly resolved, and then average across all narrative events

• Liberal score for single event: 1 if at least one true target is correctly preserved, 0 otherwise

• Strict score for single event: proportion of recognised possible targets that are correct

Page 21: Henk Harkema Andrea Setzer Ian Roberts Rob Gaizauskas Mark Hepple University of Sheffield Jeremy Rogers University of Manchester Richard Power Open University

21

Results

Before CSP After CSP

Recall 1.0 0.94

Precision 0.05 0.09

Liberal Accuracy 0.83 0.78

Strict Accuracy 0.08 0.27

Page 22: Henk Harkema Andrea Setzer Ian Roberts Rob Gaizauskas Mark Hepple University of Sheffield Jeremy Rogers University of Manchester Richard Power Open University

22

Discussion

• The results show that there is a substantial amount of ambiguity at the start, which is reduced by application of temporal constraints, as best shown by the strict accuracy score

• A large degree of ambiguity remains, but …

• Use of temporal information is conservative

• E.g., a “past” narrative event is linked to all structured events dated before the date of the letter, but could heuristically be linked to the one structured event dated immediately before the date of the letter

• We have not yet exploited additional medical information, e.g.,the locus of an investigation, nor additional temporal information, e.g., temporal relationships between events

Page 23: Henk Harkema Andrea Setzer Ian Roberts Rob Gaizauskas Mark Hepple University of Sheffield Jeremy Rogers University of Manchester Richard Power Open University

23

Conclusions & Future Work

• Information Extraction• Essential functionality implemented

• Extending coverage of system

• Evaluating performance

• Information Integration• Initial assessment of approach

• Automating processing pipeline

• Extending method to other events