adaptive sampling methods for scaling up knowledge discovery algorithms from ch 8 of instace...

14
Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of From Ch 8 of Instace selection and Costruction for Data Mining Instace selection and Costruction for Data Mining (2001) (2001) By Carlos By Carlos Domingo et.al., Kruwer Academic Publishers (Summarized by Jinsan Yang, SNU Biointelligence Lab)

Upload: blaze-thornton

Post on 13-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8

Adaptive Sampling Methods for Scaling up Knowledge Discovery

Algorithms From Ch 8 of From Ch 8 of Instace selection and Costruction for Data MiningInstace selection and Costruction for Data Mining (2001) (2001)

By CarlosBy Carlos Domingo et.al., Kruwer Academic Publishers (Summarized by Jinsan Yang, SNU Biointelligence Lab) 

Page 2: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8

AbstractMethods for large amounts of data

Adaptive sampling method instead of random sampling

Keywords Data Mining, Knowledge Discovery, Scalibility, Adaptive sampling,

Concentration Bounds

Page 3: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8

Outline

Introduction General Rule Selection Problem Adaptive Sampling Algorithm An Application of Adaselect

Problem and Algorithm

Experiments

Concluding Remarks

Page 4: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8

Introduction (1) Analysis of Large data

Redesign a known algorithm

Reduce the data size

A typical task in data miningFinding or selecting some rules or laws (General Rule Selection)

General Rule Selection: by random sampling (Batch Sampling)

Proper sample size: by Concentration Bounds or Deviation bounds

(Chernoff, Hoeffding bounds)

Problems Immense sample size is needed for good accuracy and confidence

For the batch sampling, the sample size should be determined a priori as the worst size and it is overestimated for most of the situations

Page 5: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8

Introduction (2) Overcoming

Sampling in online sequential fashion (one by one or block by block)

Adaptive sample sizes (adaptive sampling)

Page 6: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8

General Rule Selection Problem

Given Date D (discrete, categorical ?) and model set H,

Select a model h with maximum value of Utility U(h) (supervised learning)

Page 7: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8

Adaptive Sampling Algorithm (1) Extension of Hoeffding bound

Reliability of Algorithm

))(0( cdhU

Page 8: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8

Adaptive Sampling Algorithm (2)

Page 9: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8

An Application of Adaselect (1)

Can apply as a tool for the General rule selection problem Example chosen: A boosting based classification algorithm t

hat uses a simple decision stump learner as a base learner.Decision stump: a single-split decision tree.

AdaBoost for boosting by sub-sampling or re-weighting.

Apply adaptive sampling to base learner (boosting by filtering).

Use MadaBoost by controlling the initial weight as bounded.

Page 10: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8

An Application of Adaselect (2) Algorithm

Data: discrete instance vector with labels

Classification rule: decision stump

0-1 error measure, U: Utility Function

Page 11: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8

An Application of Adaselect (3) Experiments

Discretize by 5 intervals and treat missing value as another value.

Artificial inflation (100 copies) of original UCI data

Only for 2 classes

10 fold cross validation and the results are averaged over 10 runs

Computer: cpu alpha 600MHz, 250Mb memory, 4.3 Gb Hard under Linux

C4.5 and Naïve Bayes classifier for comparison

Boosting round: 10

Number of all possible decision stumps:

(set of weighted majority of ten depth-1 decision tree)

|||| DSH

Page 12: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8

An Application of Adaselect (4)

Page 13: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8

An Application of Adaselect (5)

AdaSel is faster than C4.5

faster in large sample size.

Page 14: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8

Concluding Remarks

Justification and efficiency analysis Applied in the design of a base learner for a boosting algor

ithm