scalable machine learning pipeline for meta data discovery from ebay listings
TRANSCRIPT
Spark forMetadataDiscovery
Who We Are
MetadataDiscovery andChallenges
Spark Solution
Summary
Scalable Machine Learning Pipeline for Metadata Discoveryfrom eBay Listings
Qing Zhang, Rui Li
eBay
Spark Summit 2016, June 6-8 San Francisco
Spark forMetadataDiscovery
Who We Are
MetadataDiscovery andChallenges
Spark Solution
Summary
Table of Contents
1 Who We Are
2 Metadata Discovery and Challenges
3 Spark Solution
4 Summary
Spark forMetadataDiscovery
Who We Are
MetadataDiscovery andChallenges
Spark Solution
Summary
eBay Structured Data
Metadata discovery andmanagementListing classificationCatalog and mapping listingto productInventory insights
Spark forMetadataDiscovery
Who We Are
MetadataDiscovery andChallenges
Spark Solution
Summary
eBay Structured Data
Metadata discovery andmanagementListing classificationCatalog and mapping listingto productInventory insights
Spark forMetadataDiscovery
Who We Are
MetadataDiscovery andChallenges
Spark Solution
Summary
eBay Structured Data
Metadata discovery andmanagementListing classificationCatalog and mapping listingto productInventory insights
Spark forMetadataDiscovery
Who We Are
MetadataDiscovery andChallenges
Spark Solution
Summary
Metadata Zoom In
Important name-value pairs: brand - DellSelling flow item specificsSearch navigationPowers internal applications
Spark forMetadataDiscovery
Who We Are
MetadataDiscovery andChallenges
Spark Solution
Summary
Metadata Discovery
memory box weiss yamaha fisher-price genericmodway duluth trading mek usa dnm other sahara club
gokey longhorn outdoor gear trax wolverinesk spiderman vintage mixed orchard corset
Highly rely on manual reviewUnfamiliar candidatesThe same candidate appears in multiple categories
Spark forMetadataDiscovery
Who We Are
MetadataDiscovery andChallenges
Spark Solution
Summary
Challenges in Metadata Discovery
Term Site Categoriesscott james US Men’s Clothing: Blazers & Sport Coats
Men’s Clothing: PantsMen’s Clothing: Casual ShirtsMen’s Clothing: Dress Shirts
tiella US Chandeliers & Ceiling FixturesLighting Parts & Accessories
turf US Sports Mem, Cards & Fan Shop: Cards: Footballturf UK Collectables: Cigarette/Tea/Gum Cards: Cigarette Cards:
Other Cigarette Cards
Spark forMetadataDiscovery
Who We Are
MetadataDiscovery andChallenges
Spark Solution
Summary
Data Driven Approach for Brand Discovery
Utilize seller input item specificsUtilize supply demand signals from sellers and buyersTraining data available from previously reviewed candidates
Spark forMetadataDiscovery
Who We Are
MetadataDiscovery andChallenges
Spark Solution
Summary
Supervised Machine Learning Approach
Data : 35,000 previously human reviewed metadata candidatesFeature : supply and demand signalsPrototypes with Python ScikitLogistic regression, gradient boosting trees, random forest etcRandom forest F1 0.878
Spark forMetadataDiscovery
Who We Are
MetadataDiscovery andChallenges
Spark Solution
Summary
Production Development
In the past, train offline and implement prediction component on productionFile transferring and configurations are time-consuming
Dev
Local
Production
Spark forMetadataDiscovery
Who We Are
MetadataDiscovery andChallenges
Spark Solution
Summary
Spark and Spark MLlib
MLlib GraphX
Spark provides powerful data processing APIsMLlib is a comprehensive machine learning package powered by SparkRegression, classification, clustering, dimensionality reduction etcEfficient development with local model, and flexible file access
Spark forMetadataDiscovery
Who We Are
MetadataDiscovery andChallenges
Spark Solution
Summary
The Machine Learning System
Training Prediction
Mllib
Spark
Feature
Spark
MLlib
Feature Training Prediction
Spark forMetadataDiscovery
Who We Are
MetadataDiscovery andChallenges
Spark Solution
Summary
Model Training with MLlib
val pipeline = new Pipeline (). setStages ( Array ( labelIndexer , featureIndexer ,
rf , labelConverter ))
val evaluator = new MulticlassClassificationEvaluator (). setLabelCol (" indexedLabel "). setPredictionCol (" prediction ")
val cv = new CrossValidator (). setEstimator ( pipeline ). setEvaluator ( evaluator ). setEstimatorParamMaps ( paramGrid ). setNumFolds (5)
Spark forMetadataDiscovery
Who We Are
MetadataDiscovery andChallenges
Spark Solution
Summary
Evaluations
Model F1Python Scikit prototype 0.878
MLlib local 0.865MLlib Hadoop (200 executors, production) 0.862
MLlib Hadoop (400 executors) 0.861MLlib Hadoop (50 executors) 0.857MLlib Hadoop (2 executors) 0.862
The performance variations among implementations are acceptable
Spark forMetadataDiscovery
Who We Are
MetadataDiscovery andChallenges
Spark Solution
Summary
Speed
Stage Data Size TimeFeature Generation 1.73 Billion 6 min
Train 33,000 8 minPrediction Input 650,000 4 min
Capable of running the job dailySpeed up the metadata discovery process, from months to days
Spark forMetadataDiscovery
Who We Are
MetadataDiscovery andChallenges
Spark Solution
Summary
Newly Discovered Brand
Brand ProbabilityMilkies 0.84BEABA 0.85
OXO 0.83Lorex 0.87
Plan Toys 0.85Safety 1st 0.82
Blabla 0.81Combi 0.88Graco 0.88
TotsBots 0.85Realtree 0.85
Spark forMetadataDiscovery
Who We Are
MetadataDiscovery andChallenges
Spark Solution
Summary
Summary
Spark enables fast iterations of ML application developmentMLlib is comprehensive, and well integrated with Spark frameworkDev and test locally, straightforward production deploymentCompact code : 600 linesNeed better understanding of the ML algorithm implementations in MLlib
Spark forMetadataDiscovery
Who We Are
MetadataDiscovery andChallenges
Spark Solution
Summary
Acknowledgement
Thejas DurgamAnu MandalamMeital Tahar Zahav & eBay SDO TeamJean-David Ruvini
Spark forMetadataDiscovery
Who We Are
MetadataDiscovery andChallenges
Spark Solution
Summary
Thank You!
Qing Zhang, [email protected] Li, [email protected]