enhancing text classifiers to identify disease aspect information rey-long liu dept. of medical...
TRANSCRIPT
Enhancing Text Classifiers to Identify
Disease Aspect Information
Rey-Long Liu
Dept. of Medical Informatics
Tzu Chi University
Taiwan
Outline
• Research background
• Problem definition
• The proposed approach: IDAI
• Empirical evaluation
• Conclusion
Disease Aspect Classification 2
Research Background
Disease Aspect Classification 3
Disease Aspect Information (DAI)
Disease Aspect Classification 4
An example from MedlinePlus: Several passages about three aspects of kidney cancer: treatment, symptom and sign, and etiology. It also contains several passages not related to any aspect.
You have two kidneys ... Kidney cancer forms in the … Risk factors include smoking, having certain genetic conditions and …. Often, kidney cancer doesn't have early symptoms. However, see your health care provider if you notice Blood in your urineA lump in your abdomen…Pain in your side…Treatment depends on your age, …. It might include surgery, radiation, chemotherapy …
Disease Knowledge Map: An Application of DAI
Disease Aspect Classification 5
Identification of DAI
Disease Aspect Classification 6
Healthcare professionals & consumers
Disease Info.
Query & Aspect
Medical texts for specific diseases Disease
Aspects Classifier
Disease aspect information
symptoms
diagnosistreatment
etiologyprevention
Healthcare decision support system
Disease Info.
Cross-disease query
Medical information provider
Verified Info.
Aspect Info.
Problem Definition
Disease Aspect Classification 7
Goals
• Modeling the identification of DAI as a text classification problem– Disease aspects are predefined categories of
interest, not brief descriptions of information needs
• Developing a technique to enhance various kinds of text classifiers – Given a medical text, the classifier can be more
capable in identifying those texts that talk about aspects of diseases
Disease Aspect Classification 8
Related Work• Text classification (TC)
– Weakness: multi-aspect information in a text will incur noises to text classifiers
• Segment extraction for topic detection– Weakness: designed for specific descriptions
(not for categories)
• Passage extraction for TC– Weakness: location and length of the passages
that are relevant to a specific category becoming another problem of TC
Disease Aspect Classification 9
The Proposed Approach: IDAI
Disease Aspect Classification 10
IDAI: Revising Term Frequency (TF) to Improve
Classifiers
Disease Aspect Classification 11
Categories (aspects)
Classifier Development
Training
Testing
Underlying Text ClassifierIDAI
Classification
Training Texts
A text (d)
Assessing Term Frequencies (TF)
TF of terms w.r.t. each category
Identifying Term-Category Correlation type
Two Strategies for TF Revision
Disease Aspect Classification 12
Underlying classifier G Enhanced classifier G+IDAI
Feature sets TF revision by IDAI
Accepting relevant texts
P: Set of positively correlated features (Strategy I) TF of a feature f is
amplified (reduced) if neighbors of f have the same (different) correlation type to the category(Strategy II) TF of a feature f in Q is reduced if f appears in a text segment that mainly mentions features in P
Rejecting irrelevant texts
Q: Set of negatively correlated features
• Revised TF(t,d,c) = WindowTF(t,d,c), if t is positively correlated to c; (for Strategy I)
Maxc’c{WindowTF(t,d,c’)} - InconsistencyTF(t,d,c), if t is negatively correlated to c (for Strategy II)
• WindowTF(t,d,c) =k(0.5+Pwindow,k), for each occurrence of t at k,
Pwindow,k = Distance-based sum of weights of other positively correlated terms in a window at k
• InconsistencyTF(t,d,c) = k(Pinconsistency,k), for each occurrence of t at k,
Pinconsistency,k=0.5How the text segment before k is dominated by the terms positively correlated to c
Disease Aspect Classification 13
Empirical Evaluation
Disease Aspect Classification 14
Experimental Data• Top-10 fatal diseases and top-20 cancers in
Taiwan– Total # of diseases: 28– Source: Web sites of hospitals, healthcare
associations, and department of health in Taiwan– Disease aspects (categories): 5 spects: etiology,
diagnosis, treatment, prevention, and symptom.– Splitting the texts into aspects: 4669 texts about
individual aspects– Test data: Randomly sampling 10% of the 4669 texts
and merging them into test texts of 1 to 5 aspectsDisease Aspect Classification 15
Underlying Classifiers & Experimental Baselines
• Underlying classifier – The Support Vector Machine (SVM)
classifier
• Baseline enhancer– CTFA (Liu, 2010), which employs Strategy I
for better TC
– CTFA does not consider Strategy II Disease Aspect Classification 16
Results
Disease Aspect Classification 17
Disease Aspect Classification 18
Conclusion
Disease Aspect Classification 19
• Disease knowledge map (Dmap)– Supporting evidence-based medicine, health
education, and healthcare decision support
• A key step to build a Dmap: Automatic identification of disease aspect information (DAI)
• Identification of DAI as a text classification problem
• Term proximity as key information to enhance existing classifiers to classify DAI
Disease Aspect Classification 20