online active learning with imbalanced classes

29
1 Online Active Learning with Imbalanced Classes Zahra Ferdowsi October 15 th 2013 DePaul University Accenture Technology Labs

Upload: berg

Post on 22-Feb-2016

80 views

Category:

Documents


0 download

DESCRIPTION

Zahra Ferdowsi. Online Active Learning with Imbalanced Classes. October 15 th 2013. Accenture Technology Labs. DePaul University. Do we always have enough labeled data to train the classifier?. Active Learning Scenario. Large number of unlabeled examples - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Online Active Learning with Imbalanced Classes

1

Online Active Learning with Imbalanced Classes

Zahra FerdowsiOctober 15th 2013

DePaul University

Accenture Technology Labs

Page 2: Online Active Learning with Imbalanced Classes

Do we always have enough labeled data to train the classifier?

2

Page 3: Online Active Learning with Imbalanced Classes

Active Learning Scenario

• Large number of unlabeled examples• The interactive nature (experts in the process)• Limited labeling resources• High labeling costs

3

Page 4: Online Active Learning with Imbalanced Classes

Healthcare example: motivation of this study

• Inefficiencies in the healthcare insurance process result in large monetary losses affecting corporations and consumers– $91 billion over-spent in US every year on Health

Administration and Insurance (McKinsey study’ Nov 2008)

– 131 percent increase in insurance premiums over past 10 years

4

Page 5: Online Active Learning with Imbalanced Classes

Health Insurance Claim Process

5

Page 6: Online Active Learning with Imbalanced Classes

Healthcare example

• Claim payment errors drive a significant portion of these inefficiencies– Increased administrative costs and service issues

of health plans

– Overpayment of Claims - direct loss

– Underpayment of Claims – loss in interest payment for insurer, loss in revenue for provider

6

Page 7: Online Active Learning with Imbalanced Classes

Early Rework Detection: How its done before

7

Auditors

Random SamplesManual Audits

1. Random Audits for Quality Control

Extremely Low Hit RatesLong audit times due to fully manual audits

Claims Database

Page 8: Online Active Learning with Imbalanced Classes

8

Generate expert hypotheses

Hypothesis-basedaudits

Database Queries

Auditors

Claims Database

Better hit rates but still lot of manual effort in discovering, building, updating, executing, and maintaining the hypotheses

2. Hypothesis and Rule Based Audits

Early Rework Detection: How its done before

Page 9: Online Active Learning with Imbalanced Classes

Data

9

• Duration: 2 years• Number of claims: 3.5 million• Labeled claims: 121k (49k rework)• Number of features: 16k

Page 10: Online Active Learning with Imbalanced Classes

Features

10

• Member information• Provider information• Claim header– Contract information, total amount billed,

diagnosis code, date of service• Claim line details– amount billed per service, procedure code,

counter for the procedure (quantity)

Page 11: Online Active Learning with Imbalanced Classes

Predictive Modeling• Domain characteristics– High dimensional data– Sparse data– Fast training, updating and scoring required– Ability to generate explanation for domain experts

• Classifier: Linear SVMs – Distance from margin is used as the ranking score

11

Page 12: Online Active Learning with Imbalanced Classes

• Uncertainty– Distance to the hyper plain (Shen et. al, 2004)– Entropy (Settles, 2008)

• Clustering – Density (similarity cosine)• Average similarity to all other cases (Shen et. al, 2004)

– Hierarchical (Dasgupta, 2008)– k-means using Cosine similarity (Zhu et. Al, 2001)

12

Well-known Instance Selection Strategies (ISS)

Page 13: Online Active Learning with Imbalanced Classes

Well-known ISS (con.)

• Hybrid approach: Density*Uncertainty (Zhu et. al, 2008; Settles et. al, 2008)

• Query-by-Committee– measuring the level of disagreement of a few

classifiers (Melville and Mooney, 2004)

13

Page 14: Online Active Learning with Imbalanced Classes

Experimental Setup

• 5-fold cross-validation• Evaluation metric:

precision at top 1%, 2%, and 5%.

• Numbers of instances labeled in each iteration = 100

• SVM as the base classifier using LibSVM

14

Select n instances randomly from the pool set

Remove selected instances from the pool set

Add these instances with label to the training set

Train the classifier on the training set

Use the classifier to measure precision @ k% on testing set

Is the pool set exhausted?

End

Yes

Select n instances from the pool set using an instance selection strategy

No

Page 15: Online Active Learning with Imbalanced Classes

15

How do existing ISS perform?

Claims data set

Page 16: Online Active Learning with Imbalanced Classes

16

How do existing ISS perform?

Claims data set

Page 17: Online Active Learning with Imbalanced Classes

17

Experiments on more datasets• KDD cup 1999 dataset for network intrusion detection. I use

the ”probing” intrusion as label.• HIVA is a chemoinformatics dataset was used to predict which

compounds are active against AIDS HIV infection.• ZEBRA is an embryology dataset provides a feature

representation of cells of zebrafish embryo to determine whether they are in division (meiosis) or not.

Page 18: Online Active Learning with Imbalanced Classes

18

How do existing ISS perform?

ZIBRA data set

Page 19: Online Active Learning with Imbalanced Classes

19

Do existing ISS work?• No ISS is consistently the best in all domains and

at all precision levels• Creating a validation set is challenging in since

labeled data are scarce and expensive to obtain.

Proposing an unsupervised score that can predict the performance of an ISS without using any additional labeled examples.

Page 20: Online Active Learning with Imbalanced Classes

Proposed Unsupervised Scores

20

• MS on Unlabeled set (MSU) : mean score of the top k% instances in the unlabeled set

• MS on Labeled set (MSL) : mean score of the top k% instances in the labeled set from the previous iteration

• MS on All (MSA) : mean score of the top k% instances in the combined (unlabeled set and the labeled set from the last iteration) set.

Page 21: Online Active Learning with Imbalanced Classes

Do the new unsupervised scores work?

21

• The graphs show high correlation between the score and precision.

Certainty on Claims data set

Page 22: Online Active Learning with Imbalanced Classes

Do they work?

22

• The correlation values are promising

Page 23: Online Active Learning with Imbalanced Classes

Can we use the unsupervised score to predict the best ISS in each iteration?

23

• The online algorithm has two component:– The unsupervised score (MSU) that can track the

performance of individual ISS without using any validation set.

– a simple online algorithm that uses MSU to switch between different strategies.

• The existing unsupervised score:– CEM (Classification Entropy Maximization) as score– Algorithm for switch between ISS (multi-armed bandit)

Page 24: Online Active Learning with Imbalanced Classes

Online Active Learning

24

Page 25: Online Active Learning with Imbalanced Classes

How does the online algorithm work?

25HIVA data set

Page 26: Online Active Learning with Imbalanced Classes

Conclusion

26

• Proposing an online algorithm for active learning that switches between different candidate ISS for classification in imbalanced data sets.

• This online algorithm has two components:– a score, MSU, that can track the performance of individual

ISS without using any validation set– a simple online algorithm that uses change in MSU to switch

between different strategies. • The online approach works better than (or at least

similar to) the best individual ISS and achieves 80% - 100% of the highest possible precision.

Page 27: Online Active Learning with Imbalanced Classes

Questions

27

Page 28: Online Active Learning with Imbalanced Classes

References

28

[1] Active learning challenge.[2] Kdd cup 1999.[3] J. Attenberg and F. Provost. Inactive learning?: difficulties employing active learning in practice. SIGKDD Exploration Newsletter, 12, March 2011.[4] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48– 77, 2002.[5] Y. Baram, R. El-Yaniv, K. Luz, and M. Warmuth. Online choice of active learning algorithms. Journal of Machine Learning, 2004.[6] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001.[7] P. Donmez and J. G. Carbonell. Active sampling for rank learning via optimizing the area under the roc curve. In Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval, ECIR ’09, pages 78–89, Berlin, Heidelberg, 2009. Springer.[8] P. Donmez, J. G. Carbonell, and P. N. Bennett. Dual strategy active learning. In ECML, 2007.

Page 29: Online Active Learning with Imbalanced Classes

References

29

[9] J. He and J. Carbonell. Nearest-neighbor-based active learning for rare category detection. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, MIT Press, Cambridge, MA, 2008.[10] M. Kumar, R. Ghani, and Z.-S. Mei. Data mining to predict and prevent errors in health insurance claims processing. In KDD 2010, KDD ’10, New York, USA, 2010.[11] A. McCallum and K. Nigam. Employing em in pool-based active learning for text classification. In In Proceedings of the International Conference on Machine Learning (ICML), pages 359–367. Morgan Kaufmann, 1998.[12] H. T. Nguyen and A. Smeulders. Active learning using pre-clustering. ICML, 2004.[13] B. Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison, 2009.[14] B. Settles and M. Craven. An analysis of active learning strategies for sequence labeling tasks. In EMNLP, 2008.[15] S. Tong and D. K. Nguyen. Support vector machine active learning with applications to text classification. In In Proceedings of the International Conference on Machine Learning (ICML), pages 999–1006. Morgan Kaufmann, 2000.