scalable machine learning pipeline for meta data discovery from ebay listings

20
Spark for Metadata Discovery Who We Are Metadata Discovery and Challenges Spark Solution Summary Scalable Machine Learning Pipeline for Metadata Discovery from eBay Listings Qing Zhang, Rui Li eBay Spark Summit 2016, June 6-8 San Francisco

Upload: spark-summit

Post on 08-Jan-2017

520 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Spark forMetadataDiscovery

Who We Are

MetadataDiscovery andChallenges

Spark Solution

Summary

Scalable Machine Learning Pipeline for Metadata Discoveryfrom eBay Listings

Qing Zhang, Rui Li

eBay

Spark Summit 2016, June 6-8 San Francisco

Spark forMetadataDiscovery

Who We Are

MetadataDiscovery andChallenges

Spark Solution

Summary

Table of Contents

1 Who We Are

2 Metadata Discovery and Challenges

3 Spark Solution

4 Summary

Spark forMetadataDiscovery

Who We Are

MetadataDiscovery andChallenges

Spark Solution

Summary

eBay Structured Data

Metadata discovery andmanagementListing classificationCatalog and mapping listingto productInventory insights

Spark forMetadataDiscovery

Who We Are

MetadataDiscovery andChallenges

Spark Solution

Summary

eBay Structured Data

Metadata discovery andmanagementListing classificationCatalog and mapping listingto productInventory insights

Spark forMetadataDiscovery

Who We Are

MetadataDiscovery andChallenges

Spark Solution

Summary

eBay Structured Data

Metadata discovery andmanagementListing classificationCatalog and mapping listingto productInventory insights

Spark forMetadataDiscovery

Who We Are

MetadataDiscovery andChallenges

Spark Solution

Summary

Metadata Zoom In

Important name-value pairs: brand - DellSelling flow item specificsSearch navigationPowers internal applications

Spark forMetadataDiscovery

Who We Are

MetadataDiscovery andChallenges

Spark Solution

Summary

Metadata Discovery

memory box weiss yamaha fisher-price genericmodway duluth trading mek usa dnm other sahara club

gokey longhorn outdoor gear trax wolverinesk spiderman vintage mixed orchard corset

Highly rely on manual reviewUnfamiliar candidatesThe same candidate appears in multiple categories

Spark forMetadataDiscovery

Who We Are

MetadataDiscovery andChallenges

Spark Solution

Summary

Challenges in Metadata Discovery

Term Site Categoriesscott james US Men’s Clothing: Blazers & Sport Coats

Men’s Clothing: PantsMen’s Clothing: Casual ShirtsMen’s Clothing: Dress Shirts

tiella US Chandeliers & Ceiling FixturesLighting Parts & Accessories

turf US Sports Mem, Cards & Fan Shop: Cards: Footballturf UK Collectables: Cigarette/Tea/Gum Cards: Cigarette Cards:

Other Cigarette Cards

Spark forMetadataDiscovery

Who We Are

MetadataDiscovery andChallenges

Spark Solution

Summary

Data Driven Approach for Brand Discovery

Utilize seller input item specificsUtilize supply demand signals from sellers and buyersTraining data available from previously reviewed candidates

Spark forMetadataDiscovery

Who We Are

MetadataDiscovery andChallenges

Spark Solution

Summary

Supervised Machine Learning Approach

Data : 35,000 previously human reviewed metadata candidatesFeature : supply and demand signalsPrototypes with Python ScikitLogistic regression, gradient boosting trees, random forest etcRandom forest F1 0.878

Spark forMetadataDiscovery

Who We Are

MetadataDiscovery andChallenges

Spark Solution

Summary

Production Development

In the past, train offline and implement prediction component on productionFile transferring and configurations are time-consuming

Dev

Local

Production

Spark forMetadataDiscovery

Who We Are

MetadataDiscovery andChallenges

Spark Solution

Summary

Spark and Spark MLlib

MLlib GraphX

Spark provides powerful data processing APIsMLlib is a comprehensive machine learning package powered by SparkRegression, classification, clustering, dimensionality reduction etcEfficient development with local model, and flexible file access

Spark forMetadataDiscovery

Who We Are

MetadataDiscovery andChallenges

Spark Solution

Summary

The Machine Learning System

Training Prediction

Mllib

Spark

Feature

Spark

MLlib

Feature Training Prediction

Spark forMetadataDiscovery

Who We Are

MetadataDiscovery andChallenges

Spark Solution

Summary

Model Training with MLlib

val pipeline = new Pipeline (). setStages ( Array ( labelIndexer , featureIndexer ,

rf , labelConverter ))

val evaluator = new MulticlassClassificationEvaluator (). setLabelCol (" indexedLabel "). setPredictionCol (" prediction ")

val cv = new CrossValidator (). setEstimator ( pipeline ). setEvaluator ( evaluator ). setEstimatorParamMaps ( paramGrid ). setNumFolds (5)

Spark forMetadataDiscovery

Who We Are

MetadataDiscovery andChallenges

Spark Solution

Summary

Evaluations

Model F1Python Scikit prototype 0.878

MLlib local 0.865MLlib Hadoop (200 executors, production) 0.862

MLlib Hadoop (400 executors) 0.861MLlib Hadoop (50 executors) 0.857MLlib Hadoop (2 executors) 0.862

The performance variations among implementations are acceptable

Spark forMetadataDiscovery

Who We Are

MetadataDiscovery andChallenges

Spark Solution

Summary

Speed

Stage Data Size TimeFeature Generation 1.73 Billion 6 min

Train 33,000 8 minPrediction Input 650,000 4 min

Capable of running the job dailySpeed up the metadata discovery process, from months to days

Spark forMetadataDiscovery

Who We Are

MetadataDiscovery andChallenges

Spark Solution

Summary

Newly Discovered Brand

Brand ProbabilityMilkies 0.84BEABA 0.85

OXO 0.83Lorex 0.87

Plan Toys 0.85Safety 1st 0.82

Blabla 0.81Combi 0.88Graco 0.88

TotsBots 0.85Realtree 0.85

Spark forMetadataDiscovery

Who We Are

MetadataDiscovery andChallenges

Spark Solution

Summary

Summary

Spark enables fast iterations of ML application developmentMLlib is comprehensive, and well integrated with Spark frameworkDev and test locally, straightforward production deploymentCompact code : 600 linesNeed better understanding of the ML algorithm implementations in MLlib

Spark forMetadataDiscovery

Who We Are

MetadataDiscovery andChallenges

Spark Solution

Summary

Acknowledgement

Thejas DurgamAnu MandalamMeital Tahar Zahav & eBay SDO TeamJean-David Ruvini

Spark forMetadataDiscovery

Who We Are

MetadataDiscovery andChallenges

Spark Solution

Summary

Thank You!

Qing Zhang, [email protected] Li, [email protected]