classifier inspired scaling for training set...
TRANSCRIPT
Classifier Inspired Scaling forTraining Set SelectionWalter Bennette
DISTRIBUTION A: Approved for public release: distribution unlimited: 16 May 2016. Case #88ABW-2016-2511
Outline
Instance-based classification
Training set selection
Scaling approaches
Experimental results
·
·
ENN
DROP3
CHC
-
-
-
·
Stratified
Classifier inspired
-
-
·
2/46
Instance-based classification
Instance-based classification
4/46
Instance-based classification
5/46
Instance-based classification
6/46
Instance-based classification
7/46
Instance-based classification
8/46
Instance-based classification
9/46
Instance-based classification
10/46
Instance-based classification
11/46
Instance-based classification
12/46
Instance-based classification
13/46
Instance-based classification
What are they used for?
Classification of gene expression
Content-based image retrieval
Text categorization
Load forecasting assistant for power company
·
·
·
·
14/46
Instance-based classification
What if there is a large amount of data?
15/46
Instance-based classification
What if there is a huge amount of data?
16/46
Instance-based classification
What if there is a serious amount of data?
17/46
Training set selection (TSS)
Training set selection (TSS)
Instead of maintaining all of the training data
Keep only certain necessary data points
·
·
19/46
Edited Nearest Neighbors (ENN)
Formulation:
Effect:
An instance is removed from the training data if its does not agree with themajority of it nearest neighbors
·k
Makes decision boundaries smoother
Doesn't remove much data
·
·
20/46
Edited Neares Neighbors (ENN)
21/46
DROP3
Formulation:
DROP3 (Training set TR): Selection set S. Let S = TR after applying ENN. For each instance Xi in S: Find the k +1 nearest neighbors of Xi in S. Add Xi to each of its lists of associates. For each instance Xi in S: Let with = # of associates of Xi classified correctly with Xi as a neighbor. Let without = # of associates of Xi classified correctly without Xi. If without ≥ with Remove Xi from S. For each associate a of Xi Remove Xi from a’s list of neighbors. Find a new nearest neighbor for a. Add a to its new list of associates. Endif Return S.
22/46
DROP3
Formulation:
Effect:
Iterative procedure that compares accuracy of neighbors with and withoutmembers
·
Removes much more data than ENN
Maintains acceptable accuracy
·
·
23/46
DROP3
24/46
Genetic algorithm (CHC)
Formulation:
Effectiveness:
A chromosome is a subset of the training data
A binary gene represents each instance
·
·
· Fitness = α 0 Accuracy + (1 + α) 0 Reduction
Removes a large amount of data
Achieves acceptable accuracy
·
·
25/46
Genetic algorithm (CHC)
26/46
Scaling
Scaling
As datasets grow, TSS becomes more and more expensive
May be prohibitive
The vast majority of scaling approaches rely on a stratified approach
·
·
·
28/46
No scaling
29/46
Stratified scaling
30/46
Representative Data Detection (ReDD)
Lin et al. 2015
Used for support vector machines and did not consider data reduction
·
·
31/46
Our approach
Classifier inspired approach
Based heavily on ReDD
Used for kNN and monitor data reduction
·
·
33/46
The filter
The "Balance"" dataset
Determine scale positions
Attributes
·
Balanced
Leaning right
Leaning left
-
-
-
·
Left weight
Left distance
Right weight
Right distance
-
-
-
-
34/46
The filter
35/46
The filter
36/46
The filter
37/46
Experimentation
Parameters:
Design:
Learn a Random Forest for the filter
Split data into 1/3rd, 2/3rd
·
·
Perform for ENN, CHC, and DROP3 with 3-NN
Compare no scaling, stratified, and classifier inspired
Calculate reduction, accuracy, and computation time with 10-fold CV
·
·
·
38/46
Datasets
10 experimental datasets from KEEL·
39/46
Reduction
40/46
Accuracy
41/46
Time
42/46
Results
Maintains accuracy (mostly)
Maintains data reduction
Slower than stratified approach, but may improve for larger datasets
·
·
·
43/46
Future work
Perform for many more datasets
Apply to very large datasets
Investigate if damage can be spotted apriori
·
·
·
44/46
Conclusion
Promising candidate for scaling Training Set Selection to large datasets
45/46