image classification and retrieval on spark
TRANSCRIPT
![Page 1: Image Classification and Retrieval on Spark](https://reader033.vdocuments.site/reader033/viewer/2022051709/587504ee1a28ab29208b6015/html5/thumbnails/1.jpg)
SPARK MBUTODesign & Engineering Machine Learning Pipelines
Gianvito Siciliano
Use Case: Image Classification and Retrieval
![Page 2: Image Classification and Retrieval on Spark](https://reader033.vdocuments.site/reader033/viewer/2022051709/587504ee1a28ab29208b6015/html5/thumbnails/2.jpg)
OUTLINE1. Spark ‘Mbuto intro
2. ML problems overview
3. Classification & retrieval logic
4. Classification Models
5. Image Pipeline
![Page 3: Image Classification and Retrieval on Spark](https://reader033.vdocuments.site/reader033/viewer/2022051709/587504ee1a28ab29208b6015/html5/thumbnails/3.jpg)
OUTLINE1. Spark ‘Mbuto intro
• Abstractions
• Basic Examples
2. ML problems overview
3. Classification & retrieval logic
4. Classification Models
5. Image Pipeline
![Page 4: Image Classification and Retrieval on Spark](https://reader033.vdocuments.site/reader033/viewer/2022051709/587504ee1a28ab29208b6015/html5/thumbnails/4.jpg)
SPARK MBUTO• Spark poc to (easy) create, run and test pipelines and
workflow
• Pipelines are made by sequential steps in a SparkJobApp
• Each steps is a SparkJob
• Each job share the same Spark/SQL context
• Jobs are consecutively run by JobRunner
![Page 5: Image Classification and Retrieval on Spark](https://reader033.vdocuments.site/reader033/viewer/2022051709/587504ee1a28ab29208b6015/html5/thumbnails/5.jpg)
SPARKJOB
![Page 6: Image Classification and Retrieval on Spark](https://reader033.vdocuments.site/reader033/viewer/2022051709/587504ee1a28ab29208b6015/html5/thumbnails/6.jpg)
JOBRUNNER
![Page 7: Image Classification and Retrieval on Spark](https://reader033.vdocuments.site/reader033/viewer/2022051709/587504ee1a28ab29208b6015/html5/thumbnails/7.jpg)
SPARKJOBAPP
![Page 8: Image Classification and Retrieval on Spark](https://reader033.vdocuments.site/reader033/viewer/2022051709/587504ee1a28ab29208b6015/html5/thumbnails/8.jpg)
PIPELINE
App .main
JobRunner .run
Job
Job
.execute
.execute
next job
![Page 9: Image Classification and Retrieval on Spark](https://reader033.vdocuments.site/reader033/viewer/2022051709/587504ee1a28ab29208b6015/html5/thumbnails/9.jpg)
JOB READY TO USE
![Page 10: Image Classification and Retrieval on Spark](https://reader033.vdocuments.site/reader033/viewer/2022051709/587504ee1a28ab29208b6015/html5/thumbnails/10.jpg)
READABLE APP
App .main
JobRunner .run
Job
Job
.execute
.execute
next job
![Page 11: Image Classification and Retrieval on Spark](https://reader033.vdocuments.site/reader033/viewer/2022051709/587504ee1a28ab29208b6015/html5/thumbnails/11.jpg)
PERFORMANCE LOOKUP
A
JobR
J
J
![Page 12: Image Classification and Retrieval on Spark](https://reader033.vdocuments.site/reader033/viewer/2022051709/587504ee1a28ab29208b6015/html5/thumbnails/12.jpg)
OUTLINE1. Spark ‘Mbuto intro
2. ML problems overview
• Classification
• Retrieval
3. Classification & retrieval logic
4. Classification Models
5. Image Pipeline
![Page 13: Image Classification and Retrieval on Spark](https://reader033.vdocuments.site/reader033/viewer/2022051709/587504ee1a28ab29208b6015/html5/thumbnails/13.jpg)
IMAGE CLASSIFICATION• Multiclass image classification:
1. Choose model (NN, SVM, TREE…)
2. Train/test model (with labeled images)
3. Predict the label of new images
4. Tune the model
![Page 14: Image Classification and Retrieval on Spark](https://reader033.vdocuments.site/reader033/viewer/2022051709/587504ee1a28ab29208b6015/html5/thumbnails/14.jpg)
IMAGE RETRIEVAL• Multiclass image classification:
1. Choose metric (Euclidean, cosine…)
2. Build dictionary
3. Train/test the model
4. Query and search
5. Tune the model
![Page 15: Image Classification and Retrieval on Spark](https://reader033.vdocuments.site/reader033/viewer/2022051709/587504ee1a28ab29208b6015/html5/thumbnails/15.jpg)
WHAT CHANGES?
• Pipelines architecture
• Classification logic
• How to update the model?
![Page 16: Image Classification and Retrieval on Spark](https://reader033.vdocuments.site/reader033/viewer/2022051709/587504ee1a28ab29208b6015/html5/thumbnails/16.jpg)
CLASSIFICATION PIPELINE
DATA
TRAIN CLASSIFIER
MODELNEW DATA
PREDICTION
![Page 17: Image Classification and Retrieval on Spark](https://reader033.vdocuments.site/reader033/viewer/2022051709/587504ee1a28ab29208b6015/html5/thumbnails/17.jpg)
RETRIEVAL PIPELINE
DATA
TRAIN CLASSIFIER
MODEL QUERY
PREDICTION
![Page 18: Image Classification and Retrieval on Spark](https://reader033.vdocuments.site/reader033/viewer/2022051709/587504ee1a28ab29208b6015/html5/thumbnails/18.jpg)
OUTLINE1. Spark ‘Mbuto intro
2. ML problems overview
3. Classification & retrieval logic
4. Classification Models
5. Image Pipeline
![Page 19: Image Classification and Retrieval on Spark](https://reader033.vdocuments.site/reader033/viewer/2022051709/587504ee1a28ab29208b6015/html5/thumbnails/19.jpg)
CLASSIFICATION & RETRIEVAL• Keypoints extraction from each images
• Clustering on the keypoints universe
• Represent each image with weighted cluster vector
• Train & Test the model
• Query the model (finding the most similar images)
Features Engineering
Build the Dictionary
Build theclassifier
Query the model
![Page 20: Image Classification and Retrieval on Spark](https://reader033.vdocuments.site/reader033/viewer/2022051709/587504ee1a28ab29208b6015/html5/thumbnails/20.jpg)
C. & R. JOBS• Load whole dataset
• Extract keypoints
• Reduce the keypoints universe
• Transform the features space
• Create the dictionary (aka Codebook)
• Train, test & evaluate the classifier
• Query and get prediction
DATA
TRAIN CLASSIFIER
MODEL
PREDICTION
![Page 21: Image Classification and Retrieval on Spark](https://reader033.vdocuments.site/reader033/viewer/2022051709/587504ee1a28ab29208b6015/html5/thumbnails/21.jpg)
KMeansCLASSIFIER
ImageLOADER
.transform
SiftEXTRACTOR
KMeansQUANTISER
.fit
CLUSTERS
CfIifTRANSFORMER
ClusterVectorPIVOTER
CODEBOOK
Features Engineering
Build the Dictionary
DICTIONARY
TRANSFORMER
ESTIMATOR
![Page 22: Image Classification and Retrieval on Spark](https://reader033.vdocuments.site/reader033/viewer/2022051709/587504ee1a28ab29208b6015/html5/thumbnails/22.jpg)
VectorASSEMBLER
.transform
LabelINDEXER
KNNCLASSIFIER
.fit
.transform
.fit
KMeansCLASSIFIER
TRAIN TEST
.split
EVALUATOR
Trainclassifier
Evaluateclassifier
INSAMPLE PREDICTION
OUTSAMPLE PREDICTION
CLASSIFIER
TRANSFORMER
ESTIMATOR
![Page 23: Image Classification and Retrieval on Spark](https://reader033.vdocuments.site/reader033/viewer/2022051709/587504ee1a28ab29208b6015/html5/thumbnails/23.jpg)
KNN IMPLEMENTATION• Is a comparison model: the similarity metric is crucial!
• Nearest Neighbour search (in the codebook) is the panic point:
• KDTree: not parallel (anche se…)
• LSH: hyperparams difficult to tune
• Metric Tree: disjoint features points area
• Spill tree: too many shared points
=> Hybrid Tree
![Page 24: Image Classification and Retrieval on Spark](https://reader033.vdocuments.site/reader033/viewer/2022051709/587504ee1a28ab29208b6015/html5/thumbnails/24.jpg)
HYBRID TREE• TopTree is a Metric tree
• SubLeaf Tree are Spill tree, trained in parallel
• Nodes can be:
• OVERLAP => defeatist search
• NON OVERLAP => backtracking
![Page 25: Image Classification and Retrieval on Spark](https://reader033.vdocuments.site/reader033/viewer/2022051709/587504ee1a28ab29208b6015/html5/thumbnails/25.jpg)
NEURAL NETWORK
• Convolutional works well with images
• Hyperparameters tuning is the panic point, but can be automatised (guarda il nuovo algo)
• Training is not trivial, update the model is easy to complain
![Page 26: Image Classification and Retrieval on Spark](https://reader033.vdocuments.site/reader033/viewer/2022051709/587504ee1a28ab29208b6015/html5/thumbnails/26.jpg)
WHAT MORE?• Features engineering
• Hyperparameters tuning
• Parallel optimizations
• Persist/update steps
• Ensemble models
DATA
Combiner
PREDICTION
Normalizer
pipelineModel
Cross Validator