recommending similar items at large scale

Recommending Similar Items at Large Scale

Jay KatukuriMerchandising Team - eBay

07/25/2012

Similar Items Clustering Platform

• Introduction• Merchandising Challenges• Similar Item Clustering (SIC) Architecture• Clustering Approach– Features– Method

• Cluster Assignment Service• Applications– Replacement/Equivalent items on CVIP – Non-winner– Related/Complementary items on Checkout

Introduction• Grouping of items that are similar to each other is essential

for recommendation algorithms.• Two distinct items can be considered similar if important

features are similar:– Titles– Attributes– Images.

• Similar Item Clustering (SIC) platform creates clusters of items.• These clusters are used for various recommendation systems

on the site now.

Similar Recommendations: Before

4

Similar Recommendations: Before & After

5

Merchandising Challenges - Motivation for SIC

• Non-productized inventory, long tail.– Product coverage is there only for few categories– Majority of items are ad hoc listings not covered by catalog taxonomy– Maintaining catalogs is a daunting task for the long tail.– One-of-a-kind inventory, Items are short-lived

• Unstructured data– Attribute coverage is minimal

• Sparsity in the transactional data– Very few purchases for certain kinds of items

Merchandising Challenges - Motivation for SIC

– Item-item pairs are supported by even fewer users.• We may not see users buying both a product and

accessories on eBay.• Large Data – Much bigger data set in both users and inventory than

other ecommerce sites.• Scale – Several 100 Million listings.– Several million new items every day

Similar Item Recommendations

Item Signatures: possibility ?

apple ipod touch 4g clear film protector screen

Cluster

clarks women shoe pumps classics

Similar Items: Clustering Architecture

Off-line

Hadoop

ClusterGeneration

ClusterDictionary

ClusterAssignment Service

Applications:• Merchandising• Navigation• etc.

item

Slow, Periodic

Fast

ItemClusterIndex

Run-time

Cluster Generation

Query-Item Set

• Use 1 month of User behavior data to collect initial query-set.• Filter queries by length and category specific demand/supply ratios.

Query to Items Data

Click-stream Log

Query Backend

Query Normalization

Filter Queries by Demand/Supply

Query Selection• Input Data:– Click-stream logs

• Method for choosing the queries:• Minimum frequency• Average supply threshold• Min and max token constraint• Morphological constraints–Queries that have only numbers are not allowed:

“10 5”

K-Means Clustering

Split Clusters

Query to Items Data

Base Cluster Generation

K-Means Clustering of Base Clusters

Generate Item FeaturesScoring Models

•Use item title, category and attributes as features for clustering.• Applying k-means on the base clusters separately produce better quality of clusters and makes the process faster.• Use cosine distance for item clustering.• Cluster size is chosen as a tuning parameter.

Base Cluster Generation• Base Cluster ≡ Query• Find merge candidates based on query term overlap

– Eg: “nike airmax tennis shoes” -> “nike airmax” “nike airmax tennis shoes” -> “nike shoes”

• Score candidates using cosine similarity– Term weight : TF-IDF in the query space(document=query)

• TF : Query Demand• IDF : Number of Queries

• Most similar merge candidate wins– Eg: “nike airmax tennis shoes” -> “nike airmax”

• Merge corresponding recall sets

Base Cluster Merge

• Reduces the number of base clusters to half.• Example

phrase(hand,made) phrase(king,s) queen quiltphrase(hand,made) phrase(pink,s) quilt phrase(hand,made) phrase(prae,owned) queen quiltphrase(hand,made) queen quiltphrase(hand,made) phrase(prae,owned) quiltphrase(hand,made) quilt size twinphrase(hand,made) quilt silkphrase(hand,made) quilt twinphrase(hand,made) phrase(patch,work) quiltphrase(hand,made) quilt whitephrase(hand,made) phrase(king,size) quiltphrase(hand,made) phrase(yo,yo,s) quiltphrase(hand,made) quilt salephrase(hand,made) quilt red

phrase(hand,made) quilt

Item Features GenerationItem Title

Normalization

Concept Extractor

Expansions

3x clear screen protector film skin for apple ipod touch 4 4g

3-x clear screen protector film skin for apple ipod touch 4 4-g

3-x color=clear type=‘screen protector’ film skin compatible brand = ‘for apple’ compatible product=‘ipod touch’ 4 model = 4-g

PHRASE(3,x) color=clear type=‘screen protector’ OR(film,films) OR(skin,skins) compatible brand = ‘for apple’ compatible product=‘ipod touch’ 4 model = 4-g

Normalized Item

Features

Item Features : Concept Extraction • Problem: Extract concepts from item title.• Purpose:

– Attributes coverage is sparse in many categories.– Extracted concepts can be used as features

• Approach:– Fast online service to extract entities from any eBay Text (item

title/product title etc) – Batch capability to be able to use in Hadoop– Restricted to known and important (above certain threshold)

name/values.– Unsupervised model – Use a statistical approach based on large amount of data

Examples

Women’s black dress size 16 worn once

Size - 16Gender – WomenColor - BlackStyle – Dress

Gucci medium ivory leather handbag

Brand - GucciSize – MediumColor - IvoryMaterial - LeatherStyle – Handbag

Unstructured Item Title Extracted Structured Data

Black Leather Case Cover for Reader Amazon Kindle 3 3G

Brand – Amazon Kindle 3Model – 3GType – Leather CaseColor – Black

Itemid : 380361729748 Meta : Computers & Networking

Itemid : 300477503372Meta : CSA

Itemid : 300494995198Meta : CSA

Dictionary Generation Method

Data Cleansing

Dictionary Generation

Data-warehouse

Co-occurence Matrix of Name-values

Concept Dictionary

Tf-Idf scores of name-values in a category

Other dictionaries used:Units dictionarySynonym namesFamous persons

list

Item Features : Concept Extraction• Co-occurrence of concepts is used to approximate the joint

probability.– Brand=apple, model=iphone 4

• Use of dictionaries at multiple levels reduces ambiguity in same value having multiple names.– “apple” is “compatible brand” in accessories category– “apple” is “brand” in devices category

• 'hp pavilion', 'hp' are both valid values for brand , ambiguity is resolved using tf-idf scores of name value pairs in particular category.

• Regexes were added to extract size patterns in CSA.

Item Features : Term Scores

• Problem: Given an item title in a leaf category, compute the significance of the terms in the title– While assigning items to clusters, identify which

terms in item title are more important that others• Issues:– Existing scoring models built as service– Inefficient for using them in batch mode on hadoop – Unigram models

Mutual Information

• Score of a term ‘t’ for a given item ‘i’ is computed using the mutual information of term ‘t’ and category ‘c’.– ‘c’ is the l2 category of item ‘i’.

• Item titles from EDW are used as input data.• Scores are computed for the normalized tokens.

K-Means 1/3K-Means is a well known clustering Algorithm.Choose k initial cluster centroids: m1

(1),…,mk(1)

Assignment Step:

Update Step:

Optimize:

Intra-Cluster SimilarityInter Cluster Distortion

K-Means 2/3

1. Choose Random Cluster Centroids 2. Update centroids based on neighborhood

3. Final clusters

We use a version of k-means called “Bisecting K-means” which tend to produce better quality results than standard k-means.

K-Means 3/3

• Pros– Simple to understand and implement.– Easily parallelizable– Generally produce good quality clusters when K is small.

• Cons– Slow to converge when K is large.– Cluster quality degrades with large K.– Need to decide K before hand, needs domain

knowledge and tuning to find suitable K.

K-Means Clustering : Cluster Description

• Clusters are described using the centroids of the clusters.• Cluster 1:

“L1=293 L2=56169 L3=168096 compatible brand = apple compatible product = ipod touch Phrase(4,g) clear film protector screen“

• Cluster 2:“L1=11450 L2=3034 L3=55793

brand = indigo by clarks shoe style = pumps classics”

• There are about x million clusters for US.• These x million clusters cover more than 92% of the US inventory.

Shingling for cluster merging

• Problem: Given a set of clusters, find a grouping of similar clusters.

• Approach:– Represent each cluster as a “document”– Compute 5 min 3-shingles– Check for 80% match for belonging to the same

group

Shingling basics 1/3

Cluster Assignment

Cluster AssignmentInverted

IndexCluster

Dictionary

Assignment ServicePre-processing Meta-data

Files

Rank Clusters

Voyager Call for top N clusters

Rank top N similar Items*

Closed View Item

Recommended Similar Items

Item Title, Attributes, Leaf Category, Site

Implemented using Lucene

Cluster Assignment : Pre-processing

new,2-x,for,canon,lp-e-8,battery,charger,lens,hood,eos,550-d,600-d,digital,rebel,t-3-i

new 2x for canon lp-e8 battery + charger + lens hood eos 550d 600d digital rebel t3i

new,2-x,for canon,lp-e-8,battery,charger,lens,hood,eos,550-d,600-d,digital rebel,t-3-i

2-x,for canon,lp-e-8,battery,charger,lens,hood,eos,550-d,600-d,digital rebel,t-3-i

2-x,for canon,PHRASE(lp,e,8),OR(batteries,battery,batterys),OR(charger,chargers),lens,OR(hood,PHRASE(hood,s),hoods),eos,PHR

ASE(550,d),PHRASE(600,d),digital rebel,t-3-i

2-x,for canon,phrase(lp,e,8),batteries,charger,lens,phrase(hood,s),eos,phrase(550,d),phrase(600,d),digital rebel,t-3-i

RTL Normalization

Concept Extraction

Stop Word Filtering

STL Expansion

Query Reduction / Unification

Cluster Assignment : Scoring

• Indexing Fields: Title terms and Categories• Reward matching terms and penalize on non-

matching terms

• - Reward matching terms• Number of terms matching from input

• - Importance of term in input• Query Time Boost

• - Penalize non-matching terms from cluster ‘c’• Index time boost: Field length normalization

Cluster Assignment Cross Validation

• Compute precision of recommending items from the “correct” cluster(s)– Clusters that generate purchases (BIDs and/or

BINs• Labeled Data – View-Buy data generated from user session

analysis • CVIP -> Bid/BIN in same user session• Same category

Cluster Assignment Cross Validation : Method

• For each and in , top k(5) clusters list and • Ignoring the position, compute precision in top k

• Ignores – True dependent on ranking– Assume every item belonging to a cluster is equally likely to be

recommended

– Normalized Precision

– where is the smallest cluster in

Merchandising Applications

Merchandising Applications

• There are two kinds of recommendation systems that are using SIC:– Recommending similar items on CVIP-non winner

page– Collaborative Filtering (CF) algorithms:• “Buy-Buy” – On Checkout Page• “View-Buy” – On AVIP

Similar Item Recommendations

• User bid on but lost an item– Show similar items as replacement items.

• User was watching an item that has ended– Show similar items as replacement items

• User viewed an item but did not make a purchase– Show similar items to showcase more choices.– Inject diversity in the recommendation.

recommending similar items at large scale

Documents

clusters of items

distinct items

millionnew items

categoriesmajority of

item title

sicitemitem pairs

query demandidf

query spacedocument