recommending similar items at large scale

50
Recommending Similar Items at Large Scale Jay Katukuri Merchandising Team - eBay 07/25/2012

Upload: abbott

Post on 25-Feb-2016

57 views

Category:

Documents


3 download

DESCRIPTION

Recommending Similar Items at Large Scale. Jay Katukuri Merchandising Team - eBay 07/25/ 2012. Similar Items Clustering Platform. Introduction Merchandising Challenges Similar Item Clustering (SIC) Architecture Clustering Approach Features Method Cluster Assignment Service - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Recommending Similar  Items  at Large Scale

Recommending Similar Items at Large Scale

Jay KatukuriMerchandising Team - eBay

07/25/2012

Page 2: Recommending Similar  Items  at Large Scale

Similar Items Clustering Platform

• Introduction• Merchandising Challenges• Similar Item Clustering (SIC) Architecture• Clustering Approach– Features– Method

• Cluster Assignment Service• Applications– Replacement/Equivalent items on CVIP – Non-winner– Related/Complementary items on Checkout

Page 3: Recommending Similar  Items  at Large Scale

Introduction• Grouping of items that are similar to each other is essential

for recommendation algorithms.• Two distinct items can be considered similar if important

features are similar:– Titles– Attributes– Images.

• Similar Item Clustering (SIC) platform creates clusters of items.• These clusters are used for various recommendation systems

on the site now.

Page 4: Recommending Similar  Items  at Large Scale

Similar Recommendations: Before

4

Page 5: Recommending Similar  Items  at Large Scale

Similar Recommendations: Before & After

5

Page 6: Recommending Similar  Items  at Large Scale

Merchandising Challenges - Motivation for SIC

• Non-productized inventory, long tail.– Product coverage is there only for few categories– Majority of items are ad hoc listings not covered by catalog taxonomy– Maintaining catalogs is a daunting task for the long tail.– One-of-a-kind inventory, Items are short-lived

• Unstructured data– Attribute coverage is minimal

• Sparsity in the transactional data– Very few purchases for certain kinds of items

Page 7: Recommending Similar  Items  at Large Scale

Merchandising Challenges - Motivation for SIC

– Item-item pairs are supported by even fewer users.• We may not see users buying both a product and

accessories on eBay.• Large Data – Much bigger data set in both users and inventory than

other ecommerce sites.• Scale – Several 100 Million listings.– Several million new items every day

Page 8: Recommending Similar  Items  at Large Scale

Similar Item Recommendations

Page 9: Recommending Similar  Items  at Large Scale

Similar Item Recommendations

Page 10: Recommending Similar  Items  at Large Scale

Item Signatures: possibility ?

apple ipod touch 4g clear film protector screen

Cluster

clarks women shoe pumps classics

Page 11: Recommending Similar  Items  at Large Scale

Similar Items: Clustering Architecture

Off-line

Hadoop

ClusterGeneration

ClusterDictionary

ClusterAssignment Service

Applications:• Merchandising• Navigation• etc.

item

Slow, Periodic

Fast

ItemClusterIndex

Run-time

Page 12: Recommending Similar  Items  at Large Scale

Cluster Generation

Page 13: Recommending Similar  Items  at Large Scale

Query-Item Set

• Use 1 month of User behavior data to collect initial query-set.• Filter queries by length and category specific demand/supply ratios.

Query to Items Data

Click-stream Log

Query Backend

Query Normalization

Filter Queries by Demand/Supply

Page 14: Recommending Similar  Items  at Large Scale

Query Selection• Input Data:– Click-stream logs

• Method for choosing the queries:• Minimum frequency• Average supply threshold• Min and max token constraint• Morphological constraints–Queries that have only numbers are not allowed:

“10 5”

Page 15: Recommending Similar  Items  at Large Scale

K-Means Clustering

Split Clusters

Query to Items Data

Base Cluster Generation

K-Means Clustering of Base Clusters

Generate Item FeaturesScoring Models

•Use item title, category and attributes as features for clustering.• Applying k-means on the base clusters separately produce better quality of clusters and makes the process faster.• Use cosine distance for item clustering.• Cluster size is chosen as a tuning parameter.

Page 16: Recommending Similar  Items  at Large Scale

Base Cluster Generation• Base Cluster ≡ Query• Find merge candidates based on query term overlap

– Eg: “nike airmax tennis shoes” -> “nike airmax” “nike airmax tennis shoes” -> “nike shoes”

• Score candidates using cosine similarity– Term weight : TF-IDF in the query space(document=query)

• TF : Query Demand• IDF : Number of Queries

• Most similar merge candidate wins– Eg: “nike airmax tennis shoes” -> “nike airmax”

• Merge corresponding recall sets

Page 17: Recommending Similar  Items  at Large Scale

Base Cluster Merge

• Reduces the number of base clusters to half.• Example

phrase(hand,made) phrase(king,s) queen quiltphrase(hand,made) phrase(pink,s) quilt phrase(hand,made) phrase(prae,owned) queen quiltphrase(hand,made) queen quiltphrase(hand,made) phrase(prae,owned) quiltphrase(hand,made) quilt size twinphrase(hand,made) quilt silkphrase(hand,made) quilt twinphrase(hand,made) phrase(patch,work) quiltphrase(hand,made) quilt whitephrase(hand,made) phrase(king,size) quiltphrase(hand,made) phrase(yo,yo,s) quiltphrase(hand,made) quilt salephrase(hand,made) quilt red

phrase(hand,made) quilt

Page 18: Recommending Similar  Items  at Large Scale

Item Features GenerationItem Title

Normalization

Concept Extractor

Expansions

3x clear screen protector film skin for apple ipod touch 4 4g

3-x clear screen protector film skin for apple ipod touch 4 4-g

3-x color=clear type=‘screen protector’ film skin compatible brand = ‘for apple’ compatible product=‘ipod touch’ 4 model = 4-g

PHRASE(3,x) color=clear type=‘screen protector’ OR(film,films) OR(skin,skins) compatible brand = ‘for apple’ compatible product=‘ipod touch’ 4 model = 4-g

Normalized Item

Features

Page 19: Recommending Similar  Items  at Large Scale

Item Features : Concept Extraction • Problem: Extract concepts from item title.• Purpose:

– Attributes coverage is sparse in many categories.– Extracted concepts can be used as features

• Approach:– Fast online service to extract entities from any eBay Text (item

title/product title etc) – Batch capability to be able to use in Hadoop– Restricted to known and important (above certain threshold)

name/values.– Unsupervised model – Use a statistical approach based on large amount of data

Page 20: Recommending Similar  Items  at Large Scale

Examples

Women’s black dress size 16 worn once

Size - 16Gender – WomenColor - BlackStyle – Dress

Gucci medium ivory leather handbag

Brand - GucciSize – MediumColor - IvoryMaterial - LeatherStyle – Handbag

Unstructured Item Title Extracted Structured Data

Black Leather Case Cover for Reader Amazon Kindle 3 3G

Brand – Amazon Kindle 3Model – 3GType – Leather CaseColor – Black

Itemid : 380361729748 Meta : Computers & Networking

Itemid : 300477503372Meta : CSA

Itemid : 300494995198Meta : CSA

Page 21: Recommending Similar  Items  at Large Scale

Dictionary Generation Method

Data Cleansing

Dictionary Generation

Data-warehouse

Co-occurence Matrix of Name-values

Concept Dictionary

Tf-Idf scores of name-values in a category

Other dictionaries used:Units dictionarySynonym namesFamous persons

list

Page 22: Recommending Similar  Items  at Large Scale

Item Features : Concept Extraction• Co-occurrence of concepts is used to approximate the joint

probability.– Brand=apple, model=iphone 4

• Use of dictionaries at multiple levels reduces ambiguity in same value having multiple names.– “apple” is “compatible brand” in accessories category– “apple” is “brand” in devices category

• 'hp pavilion', 'hp' are both valid values for brand , ambiguity is resolved using tf-idf scores of name value pairs in particular category.

• Regexes were added to extract size patterns in CSA.

Page 23: Recommending Similar  Items  at Large Scale

Item Features : Term Scores

• Problem: Given an item title in a leaf category, compute the significance of the terms in the title– While assigning items to clusters, identify which

terms in item title are more important that others• Issues:– Existing scoring models built as service– Inefficient for using them in batch mode on hadoop – Unigram models

Page 24: Recommending Similar  Items  at Large Scale

Mutual Information

• Score of a term ‘t’ for a given item ‘i’ is computed using the mutual information of term ‘t’ and category ‘c’.– ‘c’ is the l2 category of item ‘i’.

• Item titles from EDW are used as input data.• Scores are computed for the normalized tokens.

Page 25: Recommending Similar  Items  at Large Scale

K-Means 1/3K-Means is a well known clustering Algorithm.Choose k initial cluster centroids: m1

(1),…,mk(1)

Assignment Step:

Update Step:

Optimize:

Intra-Cluster SimilarityInter Cluster Distortion

Page 26: Recommending Similar  Items  at Large Scale

K-Means 2/3

1. Choose Random Cluster Centroids 2. Update centroids based on neighborhood

3. Final clusters

We use a version of k-means called “Bisecting K-means” which tend to produce better quality results than standard k-means.

Page 27: Recommending Similar  Items  at Large Scale

K-Means 3/3

• Pros– Simple to understand and implement.– Easily parallelizable– Generally produce good quality clusters when K is small.

• Cons– Slow to converge when K is large.– Cluster quality degrades with large K.– Need to decide K before hand, needs domain

knowledge and tuning to find suitable K.

Page 28: Recommending Similar  Items  at Large Scale

K-Means Clustering : Cluster Description

• Clusters are described using the centroids of the clusters.• Cluster 1:

“L1=293 L2=56169 L3=168096 compatible brand = apple compatible product = ipod touch Phrase(4,g) clear film protector screen“

• Cluster 2:“L1=11450 L2=3034 L3=55793

brand = indigo by clarks shoe style = pumps classics”

• There are about x million clusters for US.• These x million clusters cover more than 92% of the US inventory.

Page 29: Recommending Similar  Items  at Large Scale

Shingling for cluster merging

• Problem: Given a set of clusters, find a grouping of similar clusters.

• Approach:– Represent each cluster as a “document”– Compute 5 min 3-shingles– Check for 80% match for belonging to the same

group

Page 30: Recommending Similar  Items  at Large Scale

Shingling basics 1/3

Page 31: Recommending Similar  Items  at Large Scale

Shingling basics 2/3

Page 32: Recommending Similar  Items  at Large Scale

Shingling basics 3/3

Page 33: Recommending Similar  Items  at Large Scale

Cluster Assignment

Page 34: Recommending Similar  Items  at Large Scale

Cluster AssignmentInverted

IndexCluster

Dictionary

Assignment ServicePre-processing Meta-data

Files

Rank Clusters

Voyager Call for top N clusters

Rank top N similar Items*

Closed View Item

Recommended Similar Items

Item Title, Attributes, Leaf Category, Site

Implemented using Lucene

Page 35: Recommending Similar  Items  at Large Scale

Cluster Assignment : Pre-processing

new,2-x,for,canon,lp-e-8,battery,charger,lens,hood,eos,550-d,600-d,digital,rebel,t-3-i

new 2x for canon lp-e8 battery + charger + lens hood eos 550d 600d digital rebel t3i

new,2-x,for canon,lp-e-8,battery,charger,lens,hood,eos,550-d,600-d,digital rebel,t-3-i

2-x,for canon,lp-e-8,battery,charger,lens,hood,eos,550-d,600-d,digital rebel,t-3-i

2-x,for canon,PHRASE(lp,e,8),OR(batteries,battery,batterys),OR(charger,chargers),lens,OR(hood,PHRASE(hood,s),hoods),eos,PHR

ASE(550,d),PHRASE(600,d),digital rebel,t-3-i

2-x,for canon,phrase(lp,e,8),batteries,charger,lens,phrase(hood,s),eos,phrase(550,d),phrase(600,d),digital rebel,t-3-i

RTL Normalization

Concept Extraction

Stop Word Filtering

STL Expansion

Query Reduction / Unification

Page 36: Recommending Similar  Items  at Large Scale

Cluster Assignment : Scoring

• Indexing Fields: Title terms and Categories• Reward matching terms and penalize on non-

matching terms

• - Reward matching terms• Number of terms matching from input

• - Importance of term in input• Query Time Boost

• - Penalize non-matching terms from cluster ‘c’• Index time boost: Field length normalization

Page 37: Recommending Similar  Items  at Large Scale

Cluster Assignment Cross Validation

• Compute precision of recommending items from the “correct” cluster(s)– Clusters that generate purchases (BIDs and/or

BINs• Labeled Data – View-Buy data generated from user session

analysis • CVIP -> Bid/BIN in same user session• Same category

Page 38: Recommending Similar  Items  at Large Scale

Cluster Assignment Cross Validation : Method

• For each and in , top k(5) clusters list and • Ignoring the position, compute precision in top k

• Ignores – True dependent on ranking– Assume every item belonging to a cluster is equally likely to be

recommended

– Normalized Precision

– where is the smallest cluster in

Page 39: Recommending Similar  Items  at Large Scale

Merchandising Applications

Page 40: Recommending Similar  Items  at Large Scale

Merchandising Applications

• There are two kinds of recommendation systems that are using SIC:– Recommending similar items on CVIP-non winner

page– Collaborative Filtering (CF) algorithms:• “Buy-Buy” – On Checkout Page• “View-Buy” – On AVIP

Page 41: Recommending Similar  Items  at Large Scale

Similar Item Recommendations

• User bid on but lost an item– Show similar items as replacement items.

• User was watching an item that has ended– Show similar items as replacement items

• User viewed an item but did not make a purchase– Show similar items to showcase more choices.– Inject diversity in the recommendation.

Page 42: Recommending Similar  Items  at Large Scale

Similar Item Recommendations - Example

Page 43: Recommending Similar  Items  at Large Scale

Similar Item Recommendations ( contd..)

Page 44: Recommending Similar  Items  at Large Scale

Collaborative Filtering on SIC – “Buy-Buy”

• Once a user has purchased an Item, what else can we recommend to the user to go with his purchase?

• Drive incremental purchases – On check-out, recommend other items that “go-

together” with the purchased item– E.g. for a cell-phone we may recommend a charger,

case, screen protector.– For a dress shirt, we may recommend a tie, a dress

shoe or a jacket.

Page 45: Recommending Similar  Items  at Large Scale

Collaborative Filtering on SIC – “Buy-Buy”

• Non-productized item inventory with short lifetime makes any CF based approach difficult.

• Map the items to a higher level abstraction (clusters) to handle data sparsity.

• Re-use the item clusters generated for Similar Item Recommendation.

Page 46: Recommending Similar  Items  at Large Scale

Related Recommendations: Before & After

46

Recommendations forXbox 360 4GB on Checkout page

Page 47: Recommending Similar  Items  at Large Scale

Conclusion

• SIC platform has proven its utility and is a critical component of merchandising algorithms

• Future Work– Quality needs to be improved for long tail

categories like Art, Collectibles, etc– Better distinguish between CVIP loser/browser– End-to-End Cross-Validation framework

Page 48: Recommending Similar  Items  at Large Scale

Cluster Assignment : Aspect Demand

• Historical (6-7 months) user behavior data• Rank ordered lists of aspects used in

– Search Queries– Left Navigation Filters

• Combined using rank aggregation– Importance of aspect in category– Used as query time boost during cluster index lookup

• Example :– Input : AIR JORDAN RETRO 4 IV MILITARY BLUE 2006 SIZE 9.5 USED

» k:air jordan^2.0 k:retro^1.25 k:military^1.25 k:blue^1.2

– Also used in Concept Landing Pages (CLPs) and Popular Watches w/ Aspects

Page 49: Recommending Similar  Items  at Large Scale

Ranking

• Aspect demand data based on the input item is used in ranking– Ex: material=‘leather’ may not be there in the

cluster description.– Clarks Women Shoes

• Format Bias based on seed item’s format

Page 50: Recommending Similar  Items  at Large Scale

Format Affinity

• X% seed items are auction for CVIP non winner

• High affinity towards the seed item's format