alessandro magnani, data scientist, @walmartlabs at mlconf sf - 11/13/15

23
Classification Labels in a Fast Moving Environment Classification Labels in a Fast Moving Environment Alessandro Magnani @WalmartLabs, Walmart Global eCommerce California, USA Friday 13 th November, 2015

Upload: mlconf

Post on 21-Jan-2017

814 views

Category:

Technology


1 download

TRANSCRIPT

Classification Labels in a Fast Moving Environment

Classification Labels in a Fast Moving

Environment

Alessandro Magnani@WalmartLabs, Walmart Global eCommerce

California, USA

Friday 13th November, 2015

Classification Labels in a Fast Moving Environment

Classification Model Performance

Items Classifier

EditorN sampled items true label yi

estimate yi

accuracyEvaluation

◮ correctly evaluating classification models is critical andrequires labels

◮ labeling products is expensive

◮ need to correctly and optimally use labels

Classification Labels in a Fast Moving Environment

Classification Model Performance

Items Classifier

EditorN sampled items true label yi

estimate yi

accuracyEvaluation

Measure accuracy common approach:

◮ sample uniformly at random N items

◮ compute accuracy 1N

∑Ni=1 1{yi=yi}

Classification Labels in a Fast Moving Environment

Practical challenges

Items Classifier

EditorN sampled items true label yi

estimate yi

accuracyEvaluation

◮ items change over time

Classification Labels in a Fast Moving Environment

Practical challenges

Items Classifier

EditorN sampled items true label yi

estimate yi

accuracyEvaluation

◮ items change over time

◮ evaluation required over multiple subsets

Classification Labels in a Fast Moving Environment

Practical challenges

Items Classifier

EditorN sampled items true label yi

estimate yi

accuracyEvaluation

◮ items change over time

◮ evaluation required over multiple subsets

◮ existing labels potentially hard to reuse

Classification Labels in a Fast Moving Environment

A motivating example

compute accuracy over 1M items1K labels budget

◮ sample 1K items and get

labels yi

◮ measure accuracy11K

∑1K

i=1 1{yi=yi}

1M

p11K

Classification Labels in a Fast Moving Environment

A motivating example

500K items added, compute accuracy on all 1.5M items

◮ use previous accuracy

measure

◮ most likely inaccurate

1M 1.5M

p11K

Classification Labels in a Fast Moving Environment

A motivating example

500K items added, compute accuracy on all 1.5M items500 labels extra budget

◮ sample 500 items from the

1.5M

◮ compute accuracy on new

500 labels

◮ previous 1K labels “wasted”

1M 1.5M

p

13K

Classification Labels in a Fast Moving Environment

A motivating example

500K items added, compute accuracy on all 1.5M items500 labels extra budget, better approach

◮ sample 500 items from new

items

◮ compute accuracy on all 1.5K

labels

◮ no label “wasted”

1M 1.5M

p11K

Classification Labels in a Fast Moving Environment

A motivating example

500K items added, compute accuracy on all 1.5M itemsonly 250 labels extra budget?

◮ sample 250 items from new

items

◮ need to account for difference

in sampling

◮ accuracy:

1M 1.5M

p

12K

11.5K

(

∑1Ki=1 1{yi=yi} + 2

∑250i=1 1{ynew

i=ynew

i}

)

Classification Labels in a Fast Moving Environment

A motivating example

What are the challenges?

◮ sampling new test labels for every measure is generallyexpensive

Classification Labels in a Fast Moving Environment

A motivating example

What are the challenges?

◮ sampling new test labels for every measure is generallyexpensive

◮ knowing how previous labels were sampled required tooptimally sample new items for test

Classification Labels in a Fast Moving Environment

A motivating example

What are the challenges?

◮ sampling new test labels for every measure is generallyexpensive

◮ knowing how previous labels were sampled required tooptimally sample new items for test

◮ computing accuracy using all labels requires knowledge ofsampling profile

Classification Labels in a Fast Moving Environment

A motivating example

What are the challenges?

◮ sampling new test labels for every measure is generallyexpensive

◮ knowing how previous labels were sampled required tooptimally sample new items for test

◮ computing accuracy using all labels requires knowledge ofsampling profile

◮ overtime reusing labels can become very tricky

Classification Labels in a Fast Moving Environment

Evaluation framework

◮ pi is probability of item i to be selected for test (Bernoulli)

◮ each item carries pi and is marked if selected (store thesampling profile)

◮ accuracy:

1∑

i selected

1pi

i selected

1

pi1{yi=yi}

Classification Labels in a Fast Moving Environment

Evaluation framework

◮ pi is probability of item i to be selected for test (Bernoulli)

◮ each item carries pi and is marked if selected (store thesampling profile)

◮ accuracy:

1∑

i selected

1pi

i selected

1

pi1{yi=yi}

◮ for evaluation to be possible pj > 0 for all j labeled/unlabeled

Classification Labels in a Fast Moving Environment

Evaluation framework

◮ pi is probability of item i to be selected for test (Bernoulli)

◮ each item carries pi and is marked if selected (store thesampling profile)

◮ accuracy:

1∑

i selected

1pi

i selected

1

pi1{yi=yi}

◮ for evaluation to be possible pj > 0 for all j labeled/unlabeled

◮ all labels are used

Classification Labels in a Fast Moving Environment

Evaluation framework

◮ pi is probability of item i to be selected for test (Bernoulli)

◮ each item carries pi and is marked if selected (store thesampling profile)

◮ accuracy:

1∑

i selected

1pi

i selected

1

pi1{yi=yi}

◮ for evaluation to be possible pj > 0 for all j labeled/unlabeled

◮ all labels are used

◮ with uniform sampling this is simply “standard” accuracy

Classification Labels in a Fast Moving Environment

Evaluation framework

◮ pi is probability of item i to be selected for test (Bernoulli)

◮ each item carries pi and is marked if selected (store thesampling profile)

◮ accuracy:

1∑

i selected

1pi

i selected

1

pi1{yi=yi}

◮ for evaluation to be possible pj > 0 for all j labeled/unlabeled

◮ all labels are used

◮ with uniform sampling this is simply “standard” accuracy

◮ very closely related to importance sampling

Classification Labels in a Fast Moving Environment

Evaluation framework

given existing sampling pi and extra budget

how do we sample?

◮ minimize accuracy variance with budget constraint

◮ can be formulated as an optimization problem

◮ easy to solve

Classification Labels in a Fast Moving Environment

Evaluation framework

it works as you’d expect as budget grows:

p p

◮ new budget (blue) used more where pi is smaller

◮ given enough budget we obtain uniform sampling

Classification Labels in a Fast Moving Environment

Extensions

◮ framework works more generally for supervised learning

◮ framework can work with a wide range of different metrics

◮ optimal sampling can use model posterior to reduce variance

◮ this framework can be used on the training side together withactive learning