classifier inspired scaling for training set...

Post on 23-Jun-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Classifier Inspired Scaling forTraining Set SelectionWalter Bennette

DISTRIBUTION A: Approved for public release: distribution unlimited: 16 May 2016. Case #88ABW-2016-2511

Outline

Instance-based classification

Training set selection

Scaling approaches

Experimental results

·

·

ENN

DROP3

CHC

-

-

-

·

Stratified

Classifier inspired

-

-

·

2/46

Instance-based classification

Instance-based classification

4/46

Instance-based classification

5/46

Instance-based classification

6/46

Instance-based classification

7/46

Instance-based classification

8/46

Instance-based classification

9/46

Instance-based classification

10/46

Instance-based classification

11/46

Instance-based classification

12/46

Instance-based classification

13/46

Instance-based classification

What if there is a large amount of data?

15/46

Instance-based classification

What if there is a huge amount of data?

16/46

Instance-based classification

What if there is a serious amount of data?

17/46

Training set selection (TSS)

Training set selection (TSS)

Instead of maintaining all of the training data

Keep only certain necessary data points

·

·

19/46

Edited Nearest Neighbors (ENN)

Formulation:

Effect:

An instance is removed from the training data if its does not agree with themajority of it nearest neighbors

·k

Makes decision boundaries smoother

Doesn't remove much data

·

·

20/46

Edited Neares Neighbors (ENN)

21/46

DROP3

Formulation:

DROP3 (Training set TR): Selection set S. Let S = TR after applying ENN. For each instance Xi in S: Find the k +1 nearest neighbors of Xi in S. Add Xi to each of its lists of associates. For each instance Xi in S: Let with = # of associates of Xi classified correctly with Xi as a neighbor. Let without = # of associates of Xi classified correctly without Xi. If without ≥ with Remove Xi from S. For each associate a of Xi Remove Xi from a’s list of neighbors. Find a new nearest neighbor for a. Add a to its new list of associates. Endif Return S.

22/46

DROP3

Formulation:

Effect:

Iterative procedure that compares accuracy of neighbors with and withoutmembers

·

Removes much more data than ENN

Maintains acceptable accuracy

·

·

23/46

DROP3

24/46

Genetic algorithm (CHC)

Formulation:

Effectiveness:

A chromosome is a subset of the training data

A binary gene represents each instance

·

·

· Fitness = α 0 Accuracy + (1 + α) 0 Reduction

Removes a large amount of data

Achieves acceptable accuracy

·

·

25/46

Genetic algorithm (CHC)

26/46

Scaling

Scaling

As datasets grow, TSS becomes more and more expensive

May be prohibitive

The vast majority of scaling approaches rely on a stratified approach

·

·

·

28/46

No scaling

29/46

Stratified scaling

30/46

Representative Data Detection (ReDD)

Lin et al. 2015

Used for support vector machines and did not consider data reduction

·

·

31/46

Our approach

Classifier inspired approach

Based heavily on ReDD

Used for kNN and monitor data reduction

·

·

33/46

The filter

The "Balance"" dataset

Determine scale positions

Attributes

·

Balanced

Leaning right

Leaning left

-

-

-

·

Left weight

Left distance

Right weight

Right distance

-

-

-

-

34/46

The filter

35/46

The filter

36/46

The filter

37/46

Experimentation

Parameters:

Design:

Learn a Random Forest for the filter

Split data into 1/3rd, 2/3rd

·

·

Perform for ENN, CHC, and DROP3 with 3-NN

Compare no scaling, stratified, and classifier inspired

Calculate reduction, accuracy, and computation time with 10-fold CV

·

·

·

38/46

Datasets

10 experimental datasets from KEEL·

39/46

Reduction

40/46

Accuracy

41/46

Time

42/46

Results

Maintains accuracy (mostly)

Maintains data reduction

Slower than stratified approach, but may improve for larger datasets

·

·

·

43/46

Future work

Perform for many more datasets

Apply to very large datasets

Investigate if damage can be spotted apriori

·

·

·

44/46

Conclusion

Promising candidate for scaling Training Set Selection to large datasets

45/46

Questions

Walter Bennette walter.bennette.1@us.af.mil 315-330-4957

46/46

top related