enhancing text classifiers to identify disease aspect information rey-long liu dept. of medical...

20
Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Upload: geraldine-cole

Post on 17-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Enhancing Text Classifiers to Identify

Disease Aspect Information

Rey-Long Liu

Dept. of Medical Informatics

Tzu Chi University

Taiwan

Page 2: Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Outline

• Research background

• Problem definition

• The proposed approach: IDAI

• Empirical evaluation

• Conclusion

Disease Aspect Classification 2

Page 3: Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Research Background

Disease Aspect Classification 3

Page 4: Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Disease Aspect Information (DAI)

Disease Aspect Classification 4

An example from MedlinePlus: Several passages about three aspects of kidney cancer: treatment, symptom and sign, and etiology. It also contains several passages not related to any aspect.

You have two kidneys ... Kidney cancer forms in the … Risk factors include smoking, having certain genetic conditions and …. Often, kidney cancer doesn't have early symptoms. However, see your health care provider if you notice Blood in your urineA lump in your abdomen…Pain in your side…Treatment depends on your age, …. It might include surgery, radiation, chemotherapy …

Page 5: Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Disease Knowledge Map: An Application of DAI

Disease Aspect Classification 5

Page 6: Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Identification of DAI

Disease Aspect Classification 6

Healthcare professionals & consumers

Disease Info.

Query & Aspect

Medical texts for specific diseases Disease

Aspects Classifier

Disease aspect information

symptoms

diagnosistreatment

etiologyprevention

Healthcare decision support system

Disease Info.

Cross-disease query

Medical information provider

Verified Info.

Aspect Info.

Page 7: Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Problem Definition

Disease Aspect Classification 7

Page 8: Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Goals

• Modeling the identification of DAI as a text classification problem– Disease aspects are predefined categories of

interest, not brief descriptions of information needs

• Developing a technique to enhance various kinds of text classifiers – Given a medical text, the classifier can be more

capable in identifying those texts that talk about aspects of diseases

Disease Aspect Classification 8

Page 9: Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Related Work• Text classification (TC)

– Weakness: multi-aspect information in a text will incur noises to text classifiers

• Segment extraction for topic detection– Weakness: designed for specific descriptions

(not for categories)

• Passage extraction for TC– Weakness: location and length of the passages

that are relevant to a specific category becoming another problem of TC

Disease Aspect Classification 9

Page 10: Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

The Proposed Approach: IDAI

Disease Aspect Classification 10

Page 11: Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

IDAI: Revising Term Frequency (TF) to Improve

Classifiers

Disease Aspect Classification 11

Categories (aspects)

Classifier Development

Training

Testing

Underlying Text ClassifierIDAI

Classification

Training Texts

A text (d)

Assessing Term Frequencies (TF)

TF of terms w.r.t. each category

Identifying Term-Category Correlation type

Page 12: Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Two Strategies for TF Revision

Disease Aspect Classification 12

Underlying classifier G Enhanced classifier G+IDAI

Feature sets TF revision by IDAI

Accepting relevant texts

P: Set of positively correlated features (Strategy I) TF of a feature f is

amplified (reduced) if neighbors of f have the same (different) correlation type to the category(Strategy II) TF of a feature f in Q is reduced if f appears in a text segment that mainly mentions features in P

Rejecting irrelevant texts

Q: Set of negatively correlated features

Page 13: Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

• Revised TF(t,d,c) = WindowTF(t,d,c), if t is positively correlated to c; (for Strategy I)

Maxc’c{WindowTF(t,d,c’)} - InconsistencyTF(t,d,c), if t is negatively correlated to c (for Strategy II)

• WindowTF(t,d,c) =k(0.5+Pwindow,k), for each occurrence of t at k,

Pwindow,k = Distance-based sum of weights of other positively correlated terms in a window at k

• InconsistencyTF(t,d,c) = k(Pinconsistency,k), for each occurrence of t at k,

Pinconsistency,k=0.5How the text segment before k is dominated by the terms positively correlated to c

Disease Aspect Classification 13

Page 14: Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Empirical Evaluation

Disease Aspect Classification 14

Page 15: Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Experimental Data• Top-10 fatal diseases and top-20 cancers in

Taiwan– Total # of diseases: 28– Source: Web sites of hospitals, healthcare

associations, and department of health in Taiwan– Disease aspects (categories): 5 spects: etiology,

diagnosis, treatment, prevention, and symptom.– Splitting the texts into aspects: 4669 texts about

individual aspects– Test data: Randomly sampling 10% of the 4669 texts

and merging them into test texts of 1 to 5 aspectsDisease Aspect Classification 15

Page 16: Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Underlying Classifiers & Experimental Baselines

• Underlying classifier – The Support Vector Machine (SVM)

classifier

• Baseline enhancer– CTFA (Liu, 2010), which employs Strategy I

for better TC

– CTFA does not consider Strategy II Disease Aspect Classification 16

Page 17: Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Results

Disease Aspect Classification 17

Page 18: Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Disease Aspect Classification 18

Page 19: Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Conclusion

Disease Aspect Classification 19

Page 20: Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

• Disease knowledge map (Dmap)– Supporting evidence-based medicine, health

education, and healthcare decision support

• A key step to build a Dmap: Automatic identification of disease aspect information (DAI)

• Identification of DAI as a text classification problem

• Term proximity as key information to enhance existing classifiers to classify DAI

Disease Aspect Classification 20