enhancing text classifiers to identify disease aspect information rey-long liu dept. of medical...

Enhancing Text Classifiers to Identify

Disease Aspect Information

Rey-Long Liu

Dept. of Medical Informatics

Tzu Chi University

Taiwan

Outline

• Research background

• Problem definition

• The proposed approach: IDAI

• Empirical evaluation

• Conclusion

Disease Aspect Classification 2

Research Background


Disease Aspect Information (DAI)


An example from MedlinePlus: Several passages about three aspects of kidney cancer: treatment, symptom and sign, and etiology. It also contains several passages not related to any aspect.

You have two kidneys ... Kidney cancer forms in the … Risk factors include smoking, having certain genetic conditions and …. Often, kidney cancer doesn't have early symptoms. However, see your health care provider if you notice Blood in your urineA lump in your abdomen…Pain in your side…Treatment depends on your age, …. It might include surgery, radiation, chemotherapy …

Disease Knowledge Map: An Application of DAI


Identification of DAI


Healthcare professionals & consumers

Disease Info.

Query & Aspect

Medical texts for specific diseases Disease

Aspects Classifier

Disease aspect information

symptoms

diagnosistreatment

etiologyprevention

Healthcare decision support system

Disease Info.

Cross-disease query

Medical information provider

Verified Info.

Aspect Info.

Problem Definition


Goals

• Modeling the identification of DAI as a text classification problem– Disease aspects are predefined categories of

interest, not brief descriptions of information needs

• Developing a technique to enhance various kinds of text classifiers – Given a medical text, the classifier can be more

capable in identifying those texts that talk about aspects of diseases


Related Work• Text classification (TC)

– Weakness: multi-aspect information in a text will incur noises to text classifiers

• Segment extraction for topic detection– Weakness: designed for specific descriptions

(not for categories)

• Passage extraction for TC– Weakness: location and length of the passages

that are relevant to a specific category becoming another problem of TC


The Proposed Approach: IDAI


IDAI: Revising Term Frequency (TF) to Improve

Classifiers


Categories (aspects)

Classifier Development

Training

Testing

Underlying Text ClassifierIDAI

Classification

Training Texts

A text (d)

Assessing Term Frequencies (TF)

TF of terms w.r.t. each category

Identifying Term-Category Correlation type

Two Strategies for TF Revision


Underlying classifier G Enhanced classifier G+IDAI

Feature sets TF revision by IDAI

Accepting relevant texts

P: Set of positively correlated features (Strategy I) TF of a feature f is

amplified (reduced) if neighbors of f have the same (different) correlation type to the category(Strategy II) TF of a feature f in Q is reduced if f appears in a text segment that mainly mentions features in P

Rejecting irrelevant texts

Q: Set of negatively correlated features

• Revised TF(t,d,c) = WindowTF(t,d,c), if t is positively correlated to c; (for Strategy I)

Maxc’c{WindowTF(t,d,c’)} － InconsistencyTF(t,d,c), if t is negatively correlated to c (for Strategy II)

• WindowTF(t,d,c) =k(0.5+Pwindow,k), for each occurrence of t at k,

Pwindow,k = Distance-based sum of weights of other positively correlated terms in a window at k

• InconsistencyTF(t,d,c) = k(Pinconsistency,k), for each occurrence of t at k,

Pinconsistency,k=0.5How the text segment before k is dominated by the terms positively correlated to c


Empirical Evaluation


Experimental Data• Top-10 fatal diseases and top-20 cancers in

Taiwan– Total # of diseases: 28– Source: Web sites of hospitals, healthcare

associations, and department of health in Taiwan– Disease aspects (categories): 5 spects: etiology,

diagnosis, treatment, prevention, and symptom.– Splitting the texts into aspects: 4669 texts about

individual aspects– Test data: Randomly sampling 10% of the 4669 texts

and merging them into test texts of 1 to 5 aspectsDisease Aspect Classification 15

Underlying Classifiers & Experimental Baselines

• Underlying classifier – The Support Vector Machine (SVM)

classifier

• Baseline enhancer– CTFA (Liu, 2010), which employs Strategy I

for better TC

– CTFA does not consider Strategy II Disease Aspect Classification 16

Results


Conclusion


• Disease knowledge map (Dmap)– Supporting evidence-based medicine, health

education, and healthcare decision support

• A key step to build a Dmap: Automatic identification of disease aspect information (DAI)

• Identification of DAI as a text classification problem

• Term proximity as key information to enhance existing classifiers to classify DAI


enhancing text classifiers to identify disease aspect information rey-long liu dept. of medical...

Documents