classifier inspired scaling for training set...

Classifier Inspired Scaling forTraining Set SelectionWalter Bennette

DISTRIBUTION A: Approved for public release: distribution unlimited: 16 May 2016. Case #88ABW-2016-2511

Outline

Instance-based classification

Training set selection

Scaling approaches

Experimental results

·

·

ENN

DROP3

CHC

-

-

-

·

Stratified

Classifier inspired

-

-

·

2/46


4/46


5/46


6/46


7/46


8/46


9/46


10/46


11/46


12/46


13/46


What are they used for?

Classification of gene expression

Content-based image retrieval

Text categorization

Load forecasting assistant for power company

·

·

·

·

14/46

http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1434022&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D1434022

http://www.ncbi.nlm.nih.gov/pubmed/15755534

http://link.springer.com/chapter/10.1007/3-540-45357-1_9

http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=14540&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel1%2F59%2F569%2F00014540.pdf%3Farnumber%3D14540


What if there is a large amount of data?

15/46


What if there is a huge amount of data?

16/46


What if there is a serious amount of data?

17/46

Training set selection (TSS)

Training set selection (TSS)

Instead of maintaining all of the training data

Keep only certain necessary data points

·

·

19/46

Edited Nearest Neighbors (ENN)

Formulation:

Effect:

An instance is removed from the training data if its does not agree with themajority of it nearest neighbors

·k

Makes decision boundaries smoother

Doesn't remove much data

·

·

20/46

Edited Neares Neighbors (ENN)

21/46

DROP3

Formulation:

DROP3 (Training set TR): Selection set S. Let S = TR after applying ENN. For each instance Xi in S: Find the k +1 nearest neighbors of Xi in S. Add Xi to each of its lists of associates. For each instance Xi in S: Let with = # of associates of Xi classified correctly with Xi as a neighbor. Let without = # of associates of Xi classified correctly without Xi. If without ≥ with Remove Xi from S. For each associate a of Xi Remove Xi from a’s list of neighbors. Find a new nearest neighbor for a. Add a to its new list of associates. Endif Return S.

22/46

DROP3

Formulation:

Effect:

Iterative procedure that compares accuracy of neighbors with and withoutmembers

·

Removes much more data than ENN

Maintains acceptable accuracy

·

·

23/46

DROP3

24/46

Genetic algorithm (CHC)

Formulation:

Effectiveness:

A chromosome is a subset of the training data

A binary gene represents each instance

·

·

· Fitness = α 0 Accuracy + (1 + α) 0 Reduction

Removes a large amount of data

Achieves acceptable accuracy

·

·

25/46

Genetic algorithm (CHC)

26/46

Scaling

Scaling

As datasets grow, TSS becomes more and more expensive

May be prohibitive

The vast majority of scaling approaches rely on a stratified approach

·

·

·

28/46

No scaling

29/46

Stratified scaling

30/46

Representative Data Detection (ReDD)

Lin et al. 2015

Used for support vector machines and did not consider data reduction

·

·

31/46

Our approach

Classifier inspired approach

Based heavily on ReDD

Used for kNN and monitor data reduction

·

·

33/46

The filter

The "Balance"" dataset

Determine scale positions

Attributes

·

Balanced

Leaning right

Leaning left

-

-

-

·

Left weight

Left distance

Right weight

Right distance

-

-

-

-

34/46

The filter

35/46

The filter

36/46

The filter

37/46

Experimentation

Parameters:

Design:

Learn a Random Forest for the filter

Split data into 1/3rd, 2/3rd

·

·

Perform for ENN, CHC, and DROP3 with 3-NN

Compare no scaling, stratified, and classifier inspired

Calculate reduction, accuracy, and computation time with 10-fold CV

·

·

·

38/46

Datasets

10 experimental datasets from KEEL·

39/46

Reduction

40/46

Accuracy

41/46

Time

42/46

Results

Maintains accuracy (mostly)

Maintains data reduction

Slower than stratified approach, but may improve for larger datasets

·

·

·

43/46

Future work

Perform for many more datasets

Apply to very large datasets

Investigate if damage can be spotted apriori

·

·

·

44/46

Conclusion

Promising candidate for scaling Training Set Selection to large datasets

45/46

Questions

Walter Bennette [email protected] 315-330-4957

46/46

mailto:[email protected]

classifier inspired scaling for training set...

Documents