shop vertical classification - meetup presentation
TRANSCRIPT
Background
• Large ecommerce platform• 240K+ current customers• Many more shops created (churned or
didn’t make it to customer status)
Problem● No information about their industry in most cases
1st solution● ask them
2nd solution● We have html product descriptions for each shop● We have labelled data (mechanical turk)� Classifier
Context
• Started during a Shopify Hack Day• Pursued as a side project at work• Used sk-learn and • Moved to Spark MLlib for full scale testing
and production• Now in production
Getting Label Data
• Asked Amazon Mechanical Turkers to assess 80K stores• Having to choose among 15 verticals• Involved hundreds of turkers
80K shops
Shop Aggregated product data
1 “Nice octopolo shirt !…”
2 “Nice hat and nice shirt …”
3 “Set of <b> tires </b> …”
4 “Beef and more beef…”
5 “Tire set for bikes”
... ...
Input
80K shops
Shop Text
1 “nice octopolo shirt…”
2 “nice hat and nice shirt…”
3 “set tire…”
4 “beef beef…”
5 “tire set bike”
... ...
Cleaning
• HTML code removed• Stop word removed• Words stemmed
Shops nice octopolo shirt hat set tires beef bike ... label
1 1 1 1 ... Apparel
2 2 1 1 ... Apparel
3 1 1 ... Auto
4 2 … Food
5 1 1 1 … Auto
... ... ... ... … … … … … ... …
10K words (8 in ex)
Term Frequency80
K s
hops
Joining mechturk
Shops nice octopolo shirt hat set tires beef bike label
1 1 1 1 Apparel
2 2 1 1 Apparel
3 1 1 Auto
4 2 Food
5 1 1 1 Auto
80K
sho
ps
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 P (nice | apparel)
P (octopolo | apparel)
P (shirt | aprel)
P (hat | apparel)
P (set | apparel)
P (tires | aprel)
P (beef | apparel)
P (bike | apparel)
Apparel P(apparel)
3, 5 P (nice | auto)
P (octopolo | auto)
P (shirt | auto)
P (hat || auto)
P (set || auto)
P (tires || auto)
P (beef | auto)
P (bike | auto)
Auto P(auto)
4 P (nice | food)
P (octopolo | food)
P (shirt | food)
P (hat || food
P (set || food)
P (tires || food)
P (beef | food)
P (bike | food)
Food P(food)
15 la
bels
Naïve Bayes Model
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 P (nice | apparel)
P (octopolo | apparel)
P (shirt | aprel)
P (hat | apparel)
P (set | apparel)
P (tires | aprel)
P (beef | apparel)
P (bike | apparel)
Apparel P(apprel)
3, 5 P (nice | auto)
P (octopolo | auto)
P (shirt | auto)
P (hat || auto)
P (set || auto)
P (tires || auto)
P (beef | auto)
P (bike | auto)
Auto P(auto)
4 P (nice | food)
P (octopolo | food)
P (shirt | food)
P (hat || food
P (set || food)
P (tires || food)
P (beef | food)
P (bike | food)
Food P(food)
What and why
• These are the model parameters• Needed as input to the prediction formula
!"#$%&'#$)*+,, = +"./+01! &* $2&)
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 P (nice | apparel)
P (octopolo | apparel)
P (shirt | aprel)
P (hat | apparel)
P (set | apparel)
P (tires | aprel)
P (beef | apparel)
P (bike | apparel)
Apparel P(apparel)
3, 5 P (nice | auto)
P (octopolo | auto)
P (shirt | auto)
P (hat || auto)
P (set || auto)
P (tires || auto)
P (beef | auto)
P (bike | auto)
Auto P(auto)
4 P (nice | food)
P (octopolo | food)
P (shirt | food)
P (hat || food
P (set || food)
P (tires || food)
P (beef | food)
P (bike | food)
Food P(food)
What and why
! &* $2&) = 4 15 ∗4 781 15)
4(781)
∝ ! &* ∗ ! $2& &*)
= ! &* ∗ ! ;$< &*) * ! ;$= &*) * … * ! ;$> &*)
(Bayes Theorem)
with conditional independence assumption, actually violated..
denominator not important to compare likelihoods
!"#$%&'#$)*+,, = +"./+01! &* $2&)
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 P (nice | apparel)
P (octopolo | apparel)
P (shirt | aprel)
P (hat | apparel)
P (set | apparel)
P (tires | aprel)
P (beef | apparel)
P (bike | apparel)
Apparel P(apparel)
3, 5 P (nice | auto)
P (octopolo | auto)
P (shirt | auto)
P (hat || auto)
P (set || auto)
P (tires || auto)
P (beef | auto)
P (bike | auto)
Auto P(auto)
4 P (nice | food)
P (octopolo | food)
P (shirt | food)
P (hat || food
P (set || food)
P (tires || food)
P (beef | food)
P (bike | food)
Food P(food)
Numerical Limitation
• Multiplying many values close to 0 -> float underflow
! &* $2&) ∝ ! &* ∗ ! ;$< &*) * ! ;$= &*) * … * ! ;$> &*)
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 Log(P(..)) Log(P(..)) Log(P(..))
Log(P(..)) Log(P(..)) Log(P(..))
Log(P(..)) Log(P(..)) Apparel Log(P(..))
3, 5 Log(P(..)) Log(P(..)) Log(P(..))
Log(P(..)) Log(P(..)) Log(P(..))
Log(P(..)) Log(P(..)) Auto Log(P(..))
4 Log(P(..)) Log(P(..)) Log(P(..))
Log(P(..)) Log(P(..)) Log(P(..))
Log(P(..)) Log(P(..)) Food Log(P(..))
Numerical limitation
?2. ! &* $2&) ∝ log ! &* + log( ! ;$< &*)) + log(! ;$= &*)) + … + log(! ;$> &*))
• Way around: take log -> leads to summation instead of multiplication• No impact on comparisons across classes
! &* $2&) ∝ ! &* ∗ ! ;$< &*) * ! ;$= &*) * … * ! ;$> &*) From before, so:
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 P (nice | apparel)
P (octopolo | apparel)
P (shirt | aprel)
P (hat | apparel)
P (set | apparel)
P (tires | aprel)
P (beef | apparel)
P (bike | apparel)
Apparel P(apprel)
3, 5 P (nice | auto)
P (octopolo | auto)
P (shirt | auto)
P (hat || auto)
P (set || auto)
P (tires || auto)
P (beef | auto)
P (bike | auto)
Auto P(auto)
4 P (nice | food)
P (octopolo | food)
P (shirt | food)
P (hat || food
P (set || food)
P (tires || food)
P (beef | food)
P (bike | food)
Food P(food)
Getting cell probabilities! ;$> &*) =
DEFGH∑ DEF�KLEMN
Dealing with P(wd|cl)=0which makes P(cl|doc)=0 regardless of other words
!(&*) = DEFD
≈ DEFGHP<
∑ (DEFP<)�KLEMN
= DEFGHP<
∑ (DEF)PQ81RS�KLEMN
Shops nice octopolo shirt hat set tires beef bike label
1 1 1 1 Apparel
2 2 1 1 Apparel
3 1 1 Auto
4 2 Food
5 1 1 1 Auto
80K
sho
ps
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 3 + 1
7 + 8
1 + 1
7 + 8
2 + 1
7 + 8
1 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8Apparel 2
5
3, 5 Auto
4 Food
15 la
bels
Shops nice octopolo shirt hat set tires beef bike label
1 1 1 1 Apparel
2 2 1 1 Apparel
3 1 1 Auto
4 2 Food
5 1 1 1 Auto
80K
sho
ps
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 3 + 1
7 + 8
1 + 1
7 + 8
1 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8Apparel 2
5
3, 5 0 + 1
5 + 8
0 + 1
5 + 8
0 + 1
5 + 8
0 + 1
5 + 8
2 + 1
5 + 8
2 + 1
5 + 8
0 + 1
5 + 8
1 + 1
5 + 8Auto 2
5
4 Food
15 la
bels
Shops nice octopolo shirt hat set tires beef bike label
1 1 1 1 Apparel
2 2 1 1 Apparel
3 1 1 Auto
4 2 Food
5 1 1 1 Auto
80K
sho
ps
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 3 + 1
7 + 8
1 + 1
7 + 8
1 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8Apparel 2
5
3, 5 0 + 1
5 + 8
0 + 1
5 + 8
0 + 1
5 + 8
0 + 1
5 + 8
2 + 1
5 + 8
2 + 1
5 + 8
0 + 1
5 + 8
1 + 1
5 + 8Auto 2
5
4 0 + 1
2 + 8
0 + 1
2 + 8
0 + 1
2 + 8
0 + 1
2 + 8
0 + 1
2 + 8
0 + 1
2 + 8
2 + 1
2 + 8
0 + 1
2 + 8Food 1
5
15 la
bels
class LabeledDataFilter():...
class Featurizer():...
class Trainer()...
class Evaluator()...
class Predictor()...
class verticalPredictor():use Featurizer()use Predictor()...
product_data
Training job (every 7 days) Prediction job (every day)
modelaccuracy
product_datashop+industrymodel
Code
Change in Training Set
• Start of home card• Allowed asking for Industry in
a voluntary way• Quickly grew to 50K shops• Advantage: growing over time• Issue: training set is not fully
random
Shop NameShop URLShop AddressShop City…Shop Predicted Industry…
Shop Dimension
In the Data Warehouse
Updated daily
Results
Shops top category
turker 1 turker2 turker 3
Chive Apparel Apparel Apparel Art
Lackers Sports Sports Apparel Sports
Tesla Auto Auto Auto Sports
... ... ... ...
60-80%
Results
Shops top category
turker 1 turker2 turker 3 algotop1
algo top2
algo top3
Chive Apparel Apparel Apparel Art Apparel Sport Art
Lackers Sports Sports Apparel Sports Sports Apparel Food
Tesla Auto Auto Auto Sports Fashion Auto Electro
... ... ... ...
60-80% ~65%
ResultsShops top
categoryturker 1 turker2 turker 3 algo
top1algo top2
algo top3
Chive Apparel Apparel Apparel Art Apparel Sport Art
Lackers Sports Sports Apparel Sports Sports Apparel Food
Tesla Auto Auto Auto Sports unknown Auto Electro
... ... ... ...
90%
~75%
Business Use
Management or product teams: • What are the biggest industries per shop count, per sales made?• How does that evolve over time ?
Theme team:• We want to develop new themes for a given vertical, can we see the
top stores in this vertical to understand trends ?
Event team:• We want to be part of an event in the music business, can we get
interesting shops in this field ?
Could be improved
●More metrics: Add multiclass precision/recall○ Now available in mllib
●Better performances: Rerun for combination of parameters
○ Also added recently to mllib but missing some components