online active learning with imbalanced classes

1

Online Active Learning with Imbalanced Classes

Zahra FerdowsiOctober 15th 2013

DePaul University

Accenture Technology Labs

Do we always have enough labeled data to train the classifier?

2

Active Learning Scenario

• Large number of unlabeled examples• The interactive nature (experts in the process)• Limited labeling resources• High labeling costs

3

Healthcare example: motivation of this study

• Inefficiencies in the healthcare insurance process result in large monetary losses affecting corporations and consumers– $91 billion over-spent in US every year on Health

Administration and Insurance (McKinsey study’ Nov 2008)

– 131 percent increase in insurance premiums over past 10 years

4

Health Insurance Claim Process

5

Healthcare example

• Claim payment errors drive a significant portion of these inefficiencies– Increased administrative costs and service issues

of health plans

– Overpayment of Claims - direct loss

– Underpayment of Claims – loss in interest payment for insurer, loss in revenue for provider

6

Early Rework Detection: How its done before

7

Auditors

Random SamplesManual Audits

1. Random Audits for Quality Control

Extremely Low Hit RatesLong audit times due to fully manual audits

Claims Database

8

Generate expert hypotheses

Hypothesis-basedaudits

Database Queries

Auditors

Claims Database

Better hit rates but still lot of manual effort in discovering, building, updating, executing, and maintaining the hypotheses

2. Hypothesis and Rule Based Audits

Early Rework Detection: How its done before

Data

9

• Duration: 2 years• Number of claims: 3.5 million• Labeled claims: 121k (49k rework)• Number of features: 16k

Features

10

• Member information• Provider information• Claim header– Contract information, total amount billed,

diagnosis code, date of service• Claim line details– amount billed per service, procedure code,

counter for the procedure (quantity)

Predictive Modeling• Domain characteristics– High dimensional data– Sparse data– Fast training, updating and scoring required– Ability to generate explanation for domain experts

• Classifier: Linear SVMs – Distance from margin is used as the ranking score

11

• Uncertainty– Distance to the hyper plain (Shen et. al, 2004)– Entropy (Settles, 2008)

• Clustering – Density (similarity cosine)• Average similarity to all other cases (Shen et. al, 2004)

– Hierarchical (Dasgupta, 2008)– k-means using Cosine similarity (Zhu et. Al, 2001)

12

Well-known Instance Selection Strategies (ISS)

Well-known ISS (con.)

• Hybrid approach: Density*Uncertainty (Zhu et. al, 2008; Settles et. al, 2008)

• Query-by-Committee– measuring the level of disagreement of a few

classifiers (Melville and Mooney, 2004)

13

Experimental Setup

• 5-fold cross-validation• Evaluation metric:

precision at top 1%, 2%, and 5%.

• Numbers of instances labeled in each iteration = 100

• SVM as the base classifier using LibSVM

14

Select n instances randomly from the pool set

Remove selected instances from the pool set

Add these instances with label to the training set

Train the classifier on the training set

Use the classifier to measure precision @ k% on testing set

Is the pool set exhausted?

End

Yes

Select n instances from the pool set using an instance selection strategy

No

15

How do existing ISS perform?

Claims data set

16


Claims data set

17

Experiments on more datasets• KDD cup 1999 dataset for network intrusion detection. I use

the ”probing” intrusion as label.• HIVA is a chemoinformatics dataset was used to predict which

compounds are active against AIDS HIV infection.• ZEBRA is an embryology dataset provides a feature

representation of cells of zebrafish embryo to determine whether they are in division (meiosis) or not.

18


ZIBRA data set

19

Do existing ISS work?• No ISS is consistently the best in all domains and

at all precision levels• Creating a validation set is challenging in since

labeled data are scarce and expensive to obtain.

Proposing an unsupervised score that can predict the performance of an ISS without using any additional labeled examples.

Proposed Unsupervised Scores

20

• MS on Unlabeled set (MSU) : mean score of the top k% instances in the unlabeled set

• MS on Labeled set (MSL) : mean score of the top k% instances in the labeled set from the previous iteration

• MS on All (MSA) : mean score of the top k% instances in the combined (unlabeled set and the labeled set from the last iteration) set.

Do the new unsupervised scores work?

21

• The graphs show high correlation between the score and precision.

Certainty on Claims data set

Do they work?

22

• The correlation values are promising

Can we use the unsupervised score to predict the best ISS in each iteration?

23

• The online algorithm has two component:– The unsupervised score (MSU) that can track the

performance of individual ISS without using any validation set.

– a simple online algorithm that uses MSU to switch between different strategies.

• The existing unsupervised score:– CEM (Classification Entropy Maximization) as score– Algorithm for switch between ISS (multi-armed bandit)

Online Active Learning

24

How does the online algorithm work?

25HIVA data set

Conclusion

26

• Proposing an online algorithm for active learning that switches between different candidate ISS for classification in imbalanced data sets.

• This online algorithm has two components:– a score, MSU, that can track the performance of individual

ISS without using any validation set– a simple online algorithm that uses change in MSU to switch

between different strategies. • The online approach works better than (or at least

similar to) the best individual ISS and achieves 80% - 100% of the highest possible precision.

Questions

27

References

28

[1] Active learning challenge.[2] Kdd cup 1999.[3] J. Attenberg and F. Provost. Inactive learning?: difficulties employing active learning in practice. SIGKDD Exploration Newsletter, 12, March 2011.[4] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48– 77, 2002.[5] Y. Baram, R. El-Yaniv, K. Luz, and M. Warmuth. Online choice of active learning algorithms. Journal of Machine Learning, 2004.[6] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001.[7] P. Donmez and J. G. Carbonell. Active sampling for rank learning via optimizing the area under the roc curve. In Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval, ECIR ’09, pages 78–89, Berlin, Heidelberg, 2009. Springer.[8] P. Donmez, J. G. Carbonell, and P. N. Bennett. Dual strategy active learning. In ECML, 2007.

References

29

[9] J. He and J. Carbonell. Nearest-neighbor-based active learning for rare category detection. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, MIT Press, Cambridge, MA, 2008.[10] M. Kumar, R. Ghani, and Z.-S. Mei. Data mining to predict and prevent errors in health insurance claims processing. In KDD 2010, KDD ’10, New York, USA, 2010.[11] A. McCallum and K. Nigam. Employing em in pool-based active learning for text classification. In In Proceedings of the International Conference on Machine Learning (ICML), pages 359–367. Morgan Kaufmann, 1998.[12] H. T. Nguyen and A. Smeulders. Active learning using pre-clustering. ICML, 2004.[13] B. Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison, 2009.[14] B. Settles and M. Craven. An analysis of active learning strategies for sequence labeling tasks. In EMNLP, 2008.[15] S. Tong and D. K. Nguyen. Support vector machine active learning with applications to text classification. In In Proceedings of the International Conference on Machine Learning (ICML), pages 999–1006. Morgan Kaufmann, 2000.

online active learning with imbalanced classes

Documents

existing iss

yearsnumber of claims

millionlabeled claims

numbers of instances

cosine similarity zhu

libsvm14select n instances

endyesselect n instances

claims data set16how